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¢ Statistical And Algorithmic Foundation and Insight of Deep 
| earning 


¢ On Unified Framework of Deep Generative Models 


¢ Computational Mechanisms: Distributed Deep Learning 
Architectures 
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Part-| 








Bunun 
Outline 


¢ Probabilistic Graphical Models: Basics 


¢ An overview of DL components 
¢ Historical remarks: early days of neural networks 
¢ Modern building blocks: units, layers, activations functions, loss functions, etc. 
¢ Reverse-mode automatic differentiation (aka backpropagation) 


¢ Similarities and differences between GMs and NNs 
¢ Graphical models vs. computational graphs 
¢ Sigmoid Belief Networks as graphical models 
¢ Deep Belief Networks and Boltzmann Machines 


¢ Combining DL methods and GMs 
¢ Using outouts of NNs as Inputs to GMs 


¢ GMs with potential functions reoresented by NNs 
¢ NNs with structured outputs 


¢ Bayesian Learning of NNs 
¢ Bayesian learning of NN parameters 
¢ Deep kernel learning 
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Fundamental questions of probabilistic modeling 


¢ Representation: what is the joint probability distr. on multipole variables? 


(A) Le 
P(Xy, Xz, X3, ..., Xa) as 2 
¢ How many state configurations are there? aan 
¢ Do they all need to be represented’ ss =Ae 


¢ Can we incorporate any domain-specific insights into the representation? 


¢ Learning: where do we get the probabilities from’? 
¢ Maximum likelinood estimation’? How much data do we need’ 
¢ Are there any other established principles? 


¢ Inference: if not all variables are observable, how to compute the conditional 
distribution of latent variables given evidence’? 
¢ Computing P(H|A) would require summing over 2° configurations of the unobserved variables 
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What is a graphical model’? 


¢ A possible world of cellular signal transduction 
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GM: structure simplifies representation 


- possible world of cellular signal transduction 


' 
‘ 
Membrane! 
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Probabilistic Graphical Models 


¢ If X¥; S are conditionally independent (as described by a PGM), then 
the joint can be factored Into a product of simpler terms 


Receptor A | X, Receptor B | x, PUX, Xo, Aaj Aa he, Ag, ho, he) = 
P(X, )P(X2)P(X3|X1)P(X4|X2)P(X5|X2) 
P(X6|X3,X4)P(X7|X6)P(Xg|X5, Xe) 





| Gene G x, | Gene H EI 
¢e Why we may favor a PGM’? 


¢ Easy to incorporate domain knowledge and causal (logical) structures 


¢ Significant reduction in representation cost (2° reduced down to 18) 
© Petuum,Inc. 10 


The two types of GMs P(H|V) 


?= argmax,P,(V) 
¢ Directed edges assign causal meaning to the relationships 
(Bayesian Networks or Directed Graphical Models) — 

P(X, X2,X3,X4,X5,X6,X7,Xg) = 
P(X1)P(X2)P(X31X1)P(X41X2)P(X5 1X2) 
P(X6|X3,X4)P(X7|Xe)P(XelXs, Xe) 











¢ Undirected edges represent correlations between the variables 
(Markov Random Field or Undirected Graphical Models) 





eceptor A 


P(X,, Xp, X3,X4, Xc,X6,X7, Xe) — 

1 

7 exptE (Xy) + E(X>) zi E(x, X3) - E(X>,X4) as E(Xs,X2) ai 
E(X3,Xq4, X6) a5 E (X¢, X7 ) oF E (Xs, X6, Xg)} 





| Gene H X; 
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Outline 


¢ An overview of DL components 
¢ Historical remarks: early days of neural networks 
¢ Modern building blocks: units, layers, activations functions, loss functions, etc. 
¢ Reverse-mode automatic differentiation (aka backpropagation) 
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Buvsum 
Perceptron and Neural Nets 


¢ From biological neuron to artificial neuron (perceptron) 


Inputs McCulloch & Pitts (1943) 


xy Linear Hard 
384 Limiter 
=> — 


X9 





Threshold 


¢ From biological neuron network to artificial neuron networks 
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Output Signals 


The perceptron learning algorithm 





¢ Recall the nice property of sigmoid function = =a(1—<a) 
¢ Consider regression problem f: X>Y, for scalar Y: y = f(x) +e 
¢ We used to maximize the conditional data likelinood 
= wen Yi |Lj3 W) 
e here ... 


= arg min cj f (x;; 8))° 
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The perceptron learning algorithm 
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Batch mode: 


Do until converge: 


1. compute gradient VE,[w] 
2W=W—-7nVEp|w 


S- vie = 04) (ta —= Od) 





Xy = Input 


t, = target output 


Oy = observed output 


w, = weight | 


Incremental mode: 


Do until converge: 


=" For each training example din D 


1. compute gradient VE_|[w] 
2.W = WwW NV Ea [20 


where 
V Ea|w] = —(tg — Og)oa(1 — Od) Td 
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Neural Network Model 
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“Combined logistic models” 
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“Combined logistic models” 
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“Combined logistic models” 





Inputs 
= Output 
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Gende 
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Not really, no target for hidden units... 
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Backpropagation: 


Reverse-mode differentiation 


¢ Artificial neural networks are nothing more than complex functional compositions that can be 







Input 


variables Outputs 


Intermediate 
computations 
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Backpropagation: 


Reverse-mode differentiation 


Artificial neural networks are nothing more than complex functional compositions that can be 
represented by computation graphs: 
fie) GE 
Ox 


ab 
By applying the chain rule and using reverse accumulation, we get 


Ox d Of, Ox d Of; Of, Ox 


al En(n) 




















11€7(n) 12€7 (11) 
The algorithm is commonly known as backpropagation 
What if some of the functions are stochastic? 


Then use stochastic backpropagation! 
(to be covered in the next part) 


Modern packages can do this automatically (more later) eiPetuaniie. 22 


Modern building blocks of deep networks 





oo | XxX, W, 
¢ Activation functions 
| Woe f(Wx + b) 
e Linear and ReLU X5 w Pf) 
¢ Sigmoid and tanh — 
¢ Etc. 3 Ws 
5 5 
2 2 
—_ —_ 
O O 
input input 
Linear 


Rectified linear (ReLU) 
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Modern building blocks of deep networks 


Sen 


fully connected 
convolutional 


¢ Activation functions 
¢ Linear and ReLU 
¢ Sigmoid and tanh 
¢ Etc. 

¢ Layers 
¢ Fully connected 








h) 


* Convolutional & pooling . ® ® @) 
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¢ Etc. 


blocks with residual connections 


recurrent 
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Modern building blocks of deep networks 


¢ Activation tunctions 
¢ Linear and ReLU 
¢ Sigmoid and tanh 
¢ Etc. 


¢ Layers 
¢ Fully connected 
¢ Convolutional & pooling 
¢ Recurrent 
¢ ResNets 
¢ Etc. 


¢ Loss functions 
¢ Cross-entropy loss 
¢ Mean squared error 
¢ Etc. 


Putting things together: 


loss — 


activation 






concatenation — 
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(a part of GoogleNet) 
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Modern building blocks of deep networks 
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a 


» 


a 






(a part of GoogleNet) 


Arbitrary combinations of 
the basic building blocks 


Multiple loss functions — 
multi-target prediction, 
transfer learning, and more 


Given enough data, deeper 
architectures just keep 
improving 

Representation learning: 
the networks learn 
increasingly more abstract 
representations of the data 
that are “disentangled,” L.e., 
amenable to linear 
separation. 
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¢ Similarities and differences between GMs and NNs 
¢ Graphical models vs. computational graphs 
¢ Sigmoid Belief Networks as graphical models 
¢ Deep Belief Networks and Boltzmann Machines 
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Graphical models vs. Deep nets 


Graphical models 


¢ Representation for encoding 
meaningful knowledge and the 
associated uncertainty ina 
graphical form 





Topic proportions 


| Topic assignments: 
ooo0oo0o00 


Learning and inference in the brain 
Friston K 





The Wolicome Departmert of imaging Neuroscience, insthute of Nour 


Deep neural networks 


e Learn representations that 
facilitate computation and 
performance on the end-metric 
(intermediate representations are 
not guaranteed to be meaningful) 





input layer (5!) 4 feacure maps 
(Cl) 4 feature maps (S2)6 feature maps (C2) 6 feature maps 


convolution layer sub-samnpiing layer convolution layer sub-sampling ayer , fully comnected MLP 
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Graphical models vs. 


Graphical models 


¢ Representation for encoding 
meaningful knowledge and the 
associated uncertainty ina 
graphical form 


¢ Learning and inference are based 
on arich toolbox of well-studied 


(structure-dependent) techniques 
(e.g., EM, message passing, VI, 
MCMC, etc.) 


¢ Graphs represent models 


Deep nets 


Deep neural networks 


e Learn representations that 
facilitate computation and 


performance on the end-metric 
(intermediate representations are 
not guaranteed to be meaningful) 


e Learning is predominantly based 
on the gradient descent method 
(aka backpropagation); 

Inference is often trivial and done 
via a forward pass” 


e Graphs represent computation 
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Graphical models vs. Deep nets 


Graphical models 


Utility of the graph 
¢ Avehicle for synthesizing a global loss 
function from local structure 
¢ potential function, feature function, etc. 
¢ Avehicle for designing sound and 
efficient inference algorithms 
¢ Sum-product, mean-field, etc. 


¢ Avehicle to inspire approximation and 
penalization 


Structured MF, Tree-approximation, etc. 


¢ Avehicle for monitoring theoretical and 


empirical behavior and accuracy of 
inference 


Utility of the loss function 


¢ Amajor measure of quality of the 


learning algorithm and the model 9 = argmax,P,(V) 
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Graphical models vs. Deep nets 
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Images from Distill.pub 


Deep neural networks 


Utility of the network 
e Avehicle to conceptually synthesize 
complex decision hypothesis 
e stage-wise projection and aggregation 
e Avehicle for organizing computational 
operations 
e stage-wise update of latent states 


e Avehicle for designing processing steps 
and computing modules 


e Layer-wise parallelization 
e No obvious utility in evaluating DL 
inference algorithms 
Utility of the Loss Function 


e Global loss? Well it is complex and non- 
CONVEX... © Petuum,Inc. 31 
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Graphical models 


Utility of the graph 
¢ Avehicle for synthesizing a global loss 
function from local structure 
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empirical behavior and accuracy of 
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Utility of the loss function 


¢ Amajor measure of quality of the 
learning algorithm and the model 


Deep nets 


Deep neural networks 


Utility of the network 
e Avehicle to conceptually synthesize 
complex decision hypothesis 
e stage-wise projection and aggregation 
e Avehicle for organizing computational 
operations 
e stage-wise update of latent states 


e Avehicle for designing processing steps 
and computing modules 


e Layer-wise parallelization 
e No obvious utility in evaluating DL 
inference algorithms 
Utility of the Loss Function 


e Global loss? Well it is complex and non- 
CONVEX... © Petuum,Inc. 32 
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ML (e.g., GM) 





Empirical goal: 
Structure: 


Objective: 


Vocabulary: 


Algorithm: 


Evaluation: 
Implementation: 


Experiments: 


e.g., classification, feature learning 
Graphical 


Something aggregated from local functions 


Neuron, activation function, ... 


A single, unchallenged, inference algorithm 


Backpropagation (BP) 


On a black-box score — 
end performance 


Many tricks 


Massive, real data 
(GT unknown) 


e.g., latent variable inference, transfer 
learning 
Graphical 


Something aggregated from local functions 


Variable, potential function, ... 


A major focus of open research, many 
algorithms, and more to come 


On almost every intermediate quantity 
More or less standardized 


Modest, often simulated data (GT known) 
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Graphical Models vs. Deep Nets 


¢ So far: 
¢ Graphical models are representations of probability distributions 
¢ Neural networks are function approximators (with no probabilistic meaning) 


¢ Some of the neural nets are in fact proper graphical models 
(1.e., units/neurons represent random variables): 
¢ Boltzmann machines (Hinton & Sejnowsky, 1983) 
¢ Restricted Boltzmann machines (Smolensky, 1986) 
¢ Learning and Inference in sigmoid belief networks (Neal, 1992) 
¢ Fast learning in deep belief networks (Hinton, Osindero, Teh, 2006) 
¢ Deep Boltzmann machines (Salakhutdinov and Hinton, 2009) 


¢ Lets go through these models one-by-one 
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I: Restricted Boltzmann Machines 


¢ RBM Is a Markov random field reoresented with a bi-partite graph 


¢ All nodes in one layer/part of the graph are connected to all in the other; 
no inter-layer connections 


-_ ~ QAM Wg S ll 


Vel weight: w3; 
factor: exp( vj wi; hj) 








\ SA 68% YH, 
NO KODA eS 
‘Oe 6 @ ¢ 2. s'o6 4 






¢ Joint distribution: , 
P(y, h) — 7 ©XP pa Wij Vihj a9 » D;V; + » ay) 
LJ J 
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|: Restricted Boltzmann Machines 


¢ Log-likelinood of a single data point (unobservables marginalized out): 


log L(v) = log) EX) pa wijvjhy + » b;v; + » yy — log) 
h i,j i J 


¢ Gradient of the log-likelinood w.r.t. the model parameters: 


a a 
log L(v) = » P(hlv) = Pw h) » P(v,h) 
n vA 


OWij 








P(v,h 
iw (v, h) 


¢ where we Nave averaging over the posterior and over the joint. 
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|: Restricted Boltzmann Machines 


¢ Gradient of the log-likelinood w.r.t. the parameters (alternative form): 





O 0 
aw; log L(v) = Epcnyy) ae? (y, | — Epc ny a (y, | 


¢ Both expectations can be approximated via sampling 
¢ Sampling from the posterior is exact (RBM factorizes over h given v) 
¢ Sampling from the joint is done via MCMC (e.g., Gibbs sampling) 


¢ In the neural networks literature: 
¢* computing the Tirst term Is called the clamped / wake / positive phase 
(the network Is “awake” since It conditions on the visible variables) 
¢ Computing the second term is called the unclamped / sleep / free / negative phase 
(the network is “asleep” since it samples the visible variables from the joint; 
metaphorically, it is “dreaming” the visible inputs) © Petuum.Inc. 37 


|: Restricted Boltzmann Machines 


¢ Gradient of the log-likelinood w.r.t. the parameters (alternative form): 








O 
= Epp) cor “Pv, i)|- Ep(v,h) o- “Py, | 


¢ Learning is done by optimizing the log-likelinood of the model for a given 
data via stochastic gradient descent (SGD) 


¢ Estimation of the second term (the negative phase) heavily relies on the 
mixing properties of the Markov chain 


¢ This often causes slow convergence and requires extra computation 
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Il: Sigmoid Belief Networks 


OOOO Hidden units 


Hidden units OOOO Symptoms ! OOOO Diseases 
OOOO Symptoms 
OOOO Hidden units OOOO Hidden units 
OOOO OOCO " OOOO Hidden units 
Symptoms Diseases OOOO Diseases 1 OOOO Symptoms 
OOOO Diseases from Neal, 1992 


¢ Sigmoid belief nets are simply Bayesian networks over binary variables with conditional 
orobabilities reoresented by sigmoid functions: 


P(x;|a(x%j)) = o[ x; » Wi jXj 
x7 € (Xj) 


¢ Bayesian networks exhibit a onenomenon called “explain away effect” 


CA »> CB > If A correlates with C, then the chance of B correlating with C 
decreases. = Aand B become correlated given C. 
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Il: Sigmoid Belief Networks 


OOOO Hidden units 


Hidden units OOOO Symptoms OOOO Diseases 
OOOO Symptoms 
OOOO Hidden units OOOO Hidden units 
OOOO OOCO " OOOO Hidden units 
Symptoms Diseases OOOO Diseases OOOO Symptoms 
OOOO Diseases from Neal, 1992 


¢ Sigmoid belief nets are simply Bayesian networks over binary variables with conditional 
orobabilities reoresented by sigmoid functions: 


P(x;|a(x%j)) = o[ x; » Wi jXj 
x7 € (Xj) 


¢ Bayesian networks exhibit a onenomenon called “explain away effect” 


Note: 
Due to the “explain away effect,” when we 


condition on the visible layer in belief networks, 
hidden variables all become dependent. © Petuuminc. 40 





Sigmoid Belief Networks: 


Learning and Inference 


¢ Neal proposed Monte Carlo methods for learning and inference (Neal, 1992): 





OL >» __!___ PP = 0) log derivative 
OW); P(V =B) OW; 
Approximated with Gibbs sampling veT ; 
«= _! = OP(S = (h,v)) prob. of the visibles 
¢ Conditional distributions: <p PV =t) wi via marginalization 
PIS, =x) Sj 252787) = Yd PS = (hd) |V=d) Bayes rule + 
) veT h | OP(S = (h,v)) rearrange sums 
S| Wj G xw;; spwir ) ). EP iL Fist as HL: 
«ole Eom) Molar 5 sh) cy 
= VL LPS =F|V =9) . one = 5) 
: veT s P(S = 5) bi 
¢ No negative phase as in RBM! 
* Convergence is very slow, ae Plug-in the actual 
especially for large belief nets, ome 3 90 (si Xr<iSku) sigmoid form of the 


due to the intricate 
“explain-away effects... 


Equations from Neal, 1992 


Owi; 


6 (87 ci SkWik) 


conditional prob. 
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RBMs are infinite bellef networks 


¢ Recall the expression for the gradient of the log likelihood for RBM: 
0 
Aap = Epa) ie “PCy, i)|- I P(v,h) ae PO | 


¢Tomakea gradient update of the mode parameters, we need compute 
the expectations via sampling. 
¢ We can sample exactly from the posterior in the first term 
¢ We run block Gibbs sampling to approximately sample trom the joint distribution 











_)€ ) visible 
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sampling steps 


RBMs are infinite bellef networks 


¢ Gibbs sampling: alternate between sampling hidden and visible variables 








_) © ) visible 








sampling steps 


¢ Conditional distributions P(v|h) and P(h|v) are represented by sigmoids 


¢ Thus, we can think of Gibbs sampling from the joint distribution reoresented by 
an RBM as a top-down propagation in an intinitely deep sigmoid belief network'! 
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RBMs are infinite bellef networks 


¢RBMs are equivalent to infinitely deeo belief networks 


to generate: and So on... 















Neoec] / 





visible layer 


¢ Sampling from this is the same as sampling from 
the network on the right 





visible layer 
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RBMs are infinite bellef networks 


¢RBMs are equivalent to infinitely deeo belief networks 
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RBMs are infinite bellef networks 


¢RBMs are equivalent to infinitely deeo belief networks 


are? 40 O& 
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>< on lmyer ? gd iiatina tees re, a ee of 
¢When we train an RBM, we are really training an infinitely deep brief net! 


- lt is just that the weights of all layers are tied. 
¢ lf the weights are “untied” to some extent, we get a Deep Belief Network. 
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lll: Deep Belief Nets 


Deep Belief Network 





¢ DBNs are hybrid graphical models (chain graphs): 
¢ Exact inference in DBNs Is problematic due to explaining away effect 
¢ Training: greedy pre-training + ad-hoc fine-tuning; no proper joint training 
¢ Approximate inference is feed-forward 
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Deep Belief Networks 


Deep Belief Network +» DBNs represent a joint probability distribution 
| P(v,h1, h2,h3) = P(h2,h?)P(h2|h2)P(v}h?) 

¢ Note that P(h*,h?) is an RBM and the conditionals P(h|h7) 
and P(v|h') are represented in the sigmoid form 


¢ The model is trained by optimizing the log likelinood for a 
given data log P(v) 





Challenges: 
¢ Exact inference in DBNs is problematic due to explain away effect 
¢ Training is done in two stages: 
¢ greedy pre-training + ad-hoc fine-tuning; no proper joint training 
¢ Approximate inference is feed-forward (bottom-up) = © petuuminc. 48 


DBN: Layer-wise pre-training 


and so on... 


¢ Pre-train and freeze the 185 RBM 
¢ Stack another RBM on top and train it 


"untied" SS 7 
7 


weights 






W is 
frozen 





visible layer 


¢ The weights weights 2+ layers remain tied 


¢We repeat this procedure: pre-train and untie 
the weights layer-by-layer... 
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DBN: Layer-wise pre-training ee 


¢We repeat this procedure: pre-train and untie “ 
the weights layer-by-layer: 


¢ The weights of 3+ layers remain tied 
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«From the optimization perspective, this procedure loosely corresponds 


to an approximate block-coordinate accent on the log-likelinood 
images from Marcus Frean, MLSS Tutorial 2010 
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DBN: Fine-tuning 


¢Pre-training is Quite ad-hoc and Is unlikely to lead to a good probabilistic 
model per se ay 3 


- However, the layers of representations could perhaps be = 


eS EEE EEE EEESEESESE EEE ES 


useful for some other downstream tasks! pemensgeemenapaccnene 


SSS EEE EEE EEE EEE EEEEE 


¢ We can further “fine-tune” a pre-trained DBN for some other task 


setting A: Unsupervised learning (DBN > autoencoder) 

1. Pre-train a stack of RBMs in a greedy layer-wise fashion 

2. ~Unroll’ the RBMs to create an autoencoder 

3. Fine-tune the parameters by optimizing the reconstruction error 





Pretraining 


images from Hinton & Salakhutdinov, 2006 © Petuum,|Inc. 51 


DBN: Fine-tuning 


¢ Pre-training is Quite ad-hoc and Is unlikely to lead to a good sae in 
model per se | 


¢ However, the layers of reoresentations could perhaps be 
useful for some other downstream tasks! 


¢ We can further “fine-tune” a pre-trained DBN for some other task 


setting A: Unsupervised learning (DBN > autoencoder) 

1. Pre-train a stack of RBMs in a greedy layer-wise fashion 

2. “Unroll’ the RBMs to create an autoencoder 

3. Fine-tune the parameters by optimizing the reconstruction error 





Unrolling 
images from Hinton & Salakhutdinov, 2006 © Petuum,|Inc. 52 


DBN: Fine-tuning 


¢ Pre-training is Quite ad-hoc and Is unlikely to lead to a good probabilistic 
model per se 


¢ However, the layers of reoresentations could perhaps be 
useful for some other downstream tasks! 


¢ We can further “fine-tune” a pre-trained DBN for some other task 


setting A: Unsupervised learning (DBN > autoencoder) 

1. Pre-train a stack of RBMs in a greedy layer-wise fashion 

2. ~Unroll’ the RBMs to create an autoencoder 

3. Fine-tune the parameters by optimizing the reconstruction error 





Fine-tuning 
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DBN: Fine-tuning 


¢ Pre-training is Quite ad-hoc and Is unlikely to lead to a good probabilistic 
model per se 


¢ However, the layers of reoresentations could perhaps be 
useful for some other downstream tasks! 


¢ We can further “fine-tune” a pre-trained DBN for some other task 


setting B: Supervised learning (DBN => classifier) 

1. Pre-train a stack of RBMs in a greedy layer-wise fashion 

2. “Unroll’ the RBMs to create a feedforward classifier 

3. Fine-tune the parameters by optimizing the reconstruction error 


some intuitions about how pre-training works: 
Erhan et al.: Why Does Unsupervised Pre-training Help Deep Learning? JMLR, 2010 © Petuum,|nc. 54 
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Deep Belief Nets and Boltzmann Machines 
Deep Belief Network 





¢ DBNs are hybrid graphical models (chain graphs): 
¢ Inference in DBNs is problematic due to explaining away effect 
¢ Training: greedy pre-training + ad-hoc fine-tuning; no proper joint training 
¢ Approximate inference is feed-forward 
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Deep Belief Nets and Boltzmann Machines 


Deep Belief Network Deep Boltzmann Machine 





¢ DBMs are Tully un-directed models (Markov random fields): 
¢ Can be trained similarly as RBMs via MCMC (Hinton & Sejnowski, 1983) 


¢ Use a variational approximation of the data distribution for faster training 
(Salakhutdinov & Hinton, 2009) 


¢ Similarly, can be used to initialize other networks for downstream tasks 
© Petuum,Inc. 56 


Buu 
Graphical models vs. Deep networks 


¢ A few critical points to note about all these models: 


¢ The primary goal of deep generative models Is to represent the 
distridution of the observable variables. Adding layers of hidden 
variables allows to represent increasingly more complex distributions. 

¢ Hidden variables are secondary (auxiliary) elements used to facilitate 
learning of complex dependencies between the observables. 

¢ Training of the model is ad-hoc, but what matters is the quality of 
learned hidden representations. 

¢ Representations are judged by their usefulness on a downstream task 
(the probabilistic meaning of the model is often discarded at the end). 


¢ In contrast, classical graphical models are often concerned 
with the correctness of learning and inference of all variables 
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An old stuay of belief networks 


from the GM standpoint [Xing, Russell, Jordan, UA! 2003] 


Mean-field partitions of a sigmoid belief network for subsequent GMF inference 





otudy focused on only inference/learning accuracy, soeed, and partition 





Singleton marginal error CPU time 
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no obs with obs no obs with obs 
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Bruun 
“Optimize” Now to optimIiZe via truncation & re-opt 


¢ Energy-based modeling of the structured output (CRF) 


y (x; w) := argmin E(y, x; w) 
y 


¢ Unroll the optimization algorithm for a fixed number of steps (Domke, 2012) 
y” (x; w) = opt-alg E(y, x; w) 
y 





We can backprop through the optimization steps 
since they are just a sequence of computations 
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Dealing with structured prediction 


¢ Energy-based modeling of the structured output (CRF) 
y (x; w) := argmin E(y, x; w) 
y 


¢ Unroll the optimization algorithm for a fixed number of steps (Domke, 2012) 
y” (x; w) = opt-alg E(y, x; w) 
y 


¢ We can think of y* as some non-linear differentiable function of the inouts and 
weights > impose some loss and optimize it as any other standard computation 
graph using backproo! 


¢ Similarly, message passing based Inference algorithms can be truncated and 
converted into computational graohs (Domke, 2011; Stoyanov et al., 2011) 
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Bunun 
Outline 


¢ Combining DL methods and GMs 
¢ Using outouts of NNs as inputs to GMs 
¢ GMs with potential functions reoresented by NNs 
¢ NNs with structured outouts 
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Burwun 
Combining sequential NNs and GMs 


Hybrid: RNN + HMM 
+ s,) ss 
Y W Y WY 


Soci 


slide courtesy: Matt Gormley 
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Burwun 
Combining sequential NNs and GMs 


ot ah 2083) 


Hybrid: RNN + HMM “Ww 
The model, inference, and 

learning can be analogous to 

our NN + HMM hybrid OOO © 
Objective: log-likelihood " G) G&G 
Model: HMM/Gaussian sy 6 DD 
emissions 
Inference: forward- 
backward algorithm 
Learning: SGD with 
gradient by 
backpropagation 
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Buvyum 
Hybrid NNs + conditional GMs 





¢ In a standard CRF, each of the factor cells is a parameter. 
¢ In ahyboriad model, these values are computed by a neural network. 


. © Petuum,Inc. 64 
slide courtesy: Matt Gormley 


Buvyum 
Hybrid NNs + conditional GMs 


Hybrid: Neural Net + CRF 


Forward computation 


os oe 
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Buvyum 
Hybrid NNs + conditional GMs 


ul 
we = . 
\ \ Hybrid: CNN + CRF 
a Va “yf 
“NN + SLL” CETTE EEE 
* Model: Convolutional oe 
Neural Network el 
(CNN) with linear- | 
chain CRF 
* Training objective: 


maximize sentence- vata aa 
level likelihood (SLL) -- 





Figure from (Collobert & Weston, 2011) ae | - 
etuum, Inc. 
slide courtesy: Matt Gormley 


Using GMs as PTOGRAE prea URAC 









+ family history of diabetes 
+ [no previous heart attacks! 





«Idea: Use deep neural nets to generate parameters of a graphical model for a 
given context (e.g., specific instance or case) 


¢ Produced GMs are used to make the final prediction 


¢ GMs are built on top of interoretable variables (not deep embeddings!) and can 
be used as contextual explanations tor each prediction 
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Using GMs as Prediction Explanations 


Context Encoder 
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Attention 





A practical implementation: 
¢ Maintain a (Sparse) dictionary of GM parameters 


¢ Process complex inputs (images, text, time series, etc.) using deep nets; use soft 
attention to either select or combine models from the dictionary 


¢ Use constructed GMs (e.g., CRFs) to make predictions 
¢ Inspect GMs to understand the reasoning behind predictions 
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Bunun 
Outline 


¢ Bayesian Learning of NNs 
¢ Bayesian learning of NN parameters 
¢ Deep kernel learning 


© Petuum,Inc. 69 


Buuun 
Bayesian learning of NNs 


¢ A neural network as a probabilistic model: 
¢ Likelinood: p(y|x, 8) 
¢ Categorical distribution for classification > cross-entropy loss 
¢ Gaussian distribution for regression > squared loss (y) 
¢ Prior on parameters: p(@) 


¢ Maximum a posterior! (MAP) solution: 


> ou’ = argmaxg log p(y|x, 8)p(8) 


¢ Gaussian prior > L2 regularization 
¢ Laplace prior => L1 regularization 





¢ Bayesian learning [MacKay 1992, Neal 1996, de Freitas 2003] 
¢ Posterior: p(@|x, y) 
¢ Variational inference with approximate posterior q(@) 
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Buuun 
Bayesian learning of NNs 


Variational inference (in a nutshell): 
ming F(D, 8) = KL(q(@)|| p(@|D)) — Eqay[log p(D18)] 
ming F(D, 8) = KL(q(8)|I p(@1D)) — > log p16.) 
where 0; ~ q(@); KL term can be approximated similarly 
We can detine q(@) as a diagonal Gaussian or full-covariance Gaussian 
Alternatively, g(@) can be defined implicitly, e.g. via dropout [Gal & Ghahramani, 2016] 


6 = M - diag(z), 
Zz ~ Bernoulli(p) 


Dropping out neurons is equivalent to zeroing out 
columns of the parameter matrices (i.e., weights) 





z; = 0 corresponds to i-th column of M being dropped out 
= the procedure Is equivalent to dropout of unit i [Hinton et al., 2012] 


Variational parameters are {M, p} © Petuum,Inc. 71 
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“Infinitely Wide” Deep Models 


We have seen that an infinitely deep” network can be explained by a proper GM, 
How about an “infinitely wide” one? 


Consider a neural network with a Gaussian prior on its weights an infinitely many hidden 


neurons in the intermediate layer. 
ah 


a 


\EESS< OTS J 










Turns out, if we have a certain Gaussian prior on the 
weights of such infinite network, it will be equivalent 
to a Gaussian process [Neal 1996]. 


Infinitely many 
hidden units 





Gaussian process (GP) Is a distribution over functions: 
m(x) = Elf(x)], 
k(x,x’) = El(f(x) — m(x))(f(x’) — m(x’))], 
f(x) ~ GP(m(x), k(x,x’)). 


When used for prediction, GPs account for correlations between the data points and can 


output well-calibrated predictive uncertainty estimates. 
© Petuum,Inc. 72 


Burwun 
Gaussian Process and Deep Kernel Learning 


¢ Consider a neural network with a Gaussian prior on its weights an Infinitely many hidden neurons in 


the intermediate layer. 
2 — 
eee 3 
\ZERT ERS 


¢ Certain classes of Gaussian priors tor neural networks with infinitely many hidden units converge to 
Gaussian processes [Neal 1996] 


¢ Deep kernel [Wilson et al., 2016] 
¢ Combines the inductive biases of deep model architectures with the non-parametric flexibility of Gaussian processes 
k(x;,xj|¢) > k(g (xi, 8), g(x, 0)|b, 0) where Kij = k(xi, x) 
POI =NOIf, Bb) 
P(f ld) = Nf |m(x), K) 
¢ Learn both kernel and neural parameters {@, 6} jointly by optimizing marginal log-likelihood (or Its variational lower-bound). 


¢ Fast learning and inference with local kernel interpolation, structured inducing points, and Monte Carlo approximations 
© Petuum,Inc. 73 


Infinitely many 
hidden units 





¢ Starting from a base kernel k(x;,x;|@), transform the inputs x as 


Bursun 
Gaussian Process and Deep Kernel Learning 


¢ By adding GP as a layer to a deep neural net, we can think of it as adding 
an infinite hidden layer with a particular prior on the weights 





| wi) . > 
¢ Deep kernel learning [Wilson et al., 2016] —— = Fg ON 
Input layer (A?) (L) ee Octout laver 
¢ Combines the inductive biases of — TT \\ a. _ ian pu - 
deep models with the non-parametric i , LAC GeF 1 ow 
flexibility of Gaussian processes ky, fe 
¢ GPs add powerful regularization to ; A fA \ : 
the network rao \ | Val |) 
uncertainty estimates Oz | 





00 laver 
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Buvsum 
Deep kernel learning on sequential data 


What if we have data of 
sequential nature? 





Can we still apply the same 
reasoning and build rich 

nonparametric models on top 
recurrent nets? 
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Buvsum 
Deep kernel learning on sequential data 








The answer is YES! (y*) 
By adding a GP layer to a recurrent (y* (gs 
network, we effectively correlate y*) 

samples across time and get (82) 

predictions along with well calibrated Cn? 
uncertainty estimates. , ' 

To train such model using stochastic © 
techniques however requires some n> Cho 


additional care (See our paper). 
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Deep kernel learning on sequential data 


Lane prediction: LSTM vs GP-LSTM 
90 , 
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Deep kernel learning on sequential data 


Lead vehicle prediction: LSTM vs GP-LSTM 
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Bunun 
Conclusion 


DL & GM: the fields are similar in the beginning (structure, energy, etc.), and then 
diverge to thelr own signature pipelines 
¢ DL: most effort is directed to comparing different architectures and their components 
(models are driven by evaluating empirical performance on a downstream tasks) 
¢ DL models are good at learning robust hierarchical representations from the data and suitable 
for simple reasoning (call it “low-level cognition’) 


¢ GM: the effort is directed towards improving inference accuracy and convergence 
speed 
¢ GMs are best for provably correct inference and suitable tor high-level complex reasoning 
tasks (call it “high-level cognition”) 
¢ Convergence of both Tields is very promising! 
¢ Next part: a unified view of deep generative models in the GM interpretation 
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Plan 


¢ Statistical And Algorithmic Foundation and Insight of Deep 
| earning 


¢ On Unified Framework of Deep Generative Models 


¢ Computational Mechanisms: Distributed Deep Learning 
Architectures 
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Bunun 
Outline 


¢ Overview of advances in deep generative models 


¢ Backgrounds of deep generative models 
¢e Wake sleep algorithm 
¢ Variational autoencoders 
¢ Generative adversarial networks 
¢ A unified view of deep generative models 
¢ new formulations of deep generative models 
¢ Symmetric modeling of latent and visible variables 
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Bunsum 
Outline 


¢ Overview of advances in deep generative models 


¢ Backgrounds of deep generative models 
¢e Wake sleep algorithm 
¢ Variational autoencoders 
¢ Generative adversarial networks 


¢ A unified view of deep generative models 
¢ new formulations of deep generative models 
¢ Symmetric modeling of latent and visible variables 
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Deep generative models 


¢ Define probabilistic distributions over a set of variables 
¢ ‘Deep’ means multiple layers of hidden variables! 


i 
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Early forms of deep generative models 


¢ Hierarchical Bayesian models 
e SIGMOIC orief nets [Neal 1992] iy /T hidden 7(2) = £0 1} 
nh ) 


Ave 


D (ten = = 1/0, 2, (1) )= =o a) 
re) = oun?) =0(02) 


z = {0,1} 
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Early forms of deep generative models 


¢ Hierarchical Bayesian models 
¢Sigmoid brief nets [Neal 1992] 


generative 
biases 


layer 
¢ Neural network models 


| Z 
e HelmholtZ Machines  payan et al.,1995] : 


inference 
weights 


» OOGOO 


[Dayan et al. 1995] 
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Bown 
Early forms of deep generative models 
¢ Hierarchical Bayesian models 
¢Sigmoid brief nets (Neal 1992) 


¢ Neural network models 
e HelmholtZ Machines [payan et al.,1995] 
¢ Predictability MiInimIZatiOn [schmidhuber 1995] 





‘_ 
Figure courtesy: ieteriitaive 1996 
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Early forms of deep generative models 


¢ Training of DGMs via an EM style tramework 


eSampling / data augmentation 


Z = {Z1,Z2} 
1 ~p(Z4|Z2,X) 


24 
2 ~p(Z2|Z7°"", X) 


22 
¢ Variational inference 
log p(x) = Eq, z\xyllog pe(%, Z)] — KL(qg(ZIx) || p(Z)) = LG, G; x) 
maxg gL (9, P; x) 
¢e Wake sleep 
Wake: ming Eq, (z\x) llog pg (x|z) | 


Sleep: ming E, ,(x)z) log V¢ (z|x)] 
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Buvuyum 
Resurgence of deep generative models 


¢ Restricted Boltzmann machines (RBMS) tsmotensky, 1986; 
¢ Building blocks of deep probabilistic models 


hidden ( > = 





NS oe S SS XO x x Ks 


(4 4A 
r Cy 
\ ie Se sasha’ exp(v; wi; hj) 





visible 
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Buvuyum 
Resurgence of deep generative models 


¢ Restricted Boltzmann machines (RBMS) tsmotensky, 1986; 
¢ Building blocks of deep probabilistic models 
¢ Deep belief networks (DBNS) tinton et ai., 2006 
¢ Hybrid graphical model 
¢ Inference in DBNs Is problematic due to explaining away 
¢ Deep Boltzmann Machines (DBMS) tsatakhutainoy & Hinton, 20091 
e Undirected model Deep Belief Network Deep Boltzmann Machine 
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Buvuyum 
Resurgence of deep generative models 


¢ Variational autoencoders (VAES) txingma & welling, 20144 
/ Neural Variational Inference and Learning (NVIL) tmnin « Gregor, 2014) 


fol elo 
dp (Z|x) e Do(x|Z) 
inference model & generative model 


Figure courtesy: Kingma & Welling, 2014 


© Petuum,Inc. 91 


Buvsum 
Resurgence of deep generative models 
¢ Variational autoencoders (VAES) tkingma & welling, 2014 


/ Neural Variational Inference and Learning (NVIL) [mnin a Gregor, 2014] 
¢ Generative adversarial networks (GANs) 





Gg: generative model 
code data/gen Dg: discriminator 
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Buvuyum 
Resurgence of deep generative models 


¢ Variational autoencoders (VAES) tkingma & welling, 2014) 
/ Neural Variational Inference and Learning (NVIL) tmnin « Gregor, 2014) 
¢ Generative adversarial networks (GANSs) 
¢ Generative moment matching networks (GMMNSs) tietat., 2015; bziugaite et 


al., 2015] 
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Buvuyum 
Resurgence of deep generative models 


¢ Variational autoencoders (VAES) tkingma & welling, 2014) 
/ Neural Variational Inference and Learning (NVIL) [mnin a Gregor, 2014] 
¢ Generative adversarial networks (GANs) 
¢ Generative moment matching networks (GMMNs) tietat. 2015: pziugaite et 


al., 2015] 


¢ Autoregressive neural networks 


EZ ————= 
ye) a) 
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Bunun 
Outline 


¢ Overview of advances in deep generative models 


¢ Backgrounds of deep generative models 
¢ Wake sleep algorithm 
¢ Variational autoencoders 
¢ Generative adversarial networks 


¢ A unified view of deep generative models 
¢ new formulations of deep generative models 
¢ Symmetric modeling of latent and visible variables 
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synonyms in the literature 


¢ Posterior Distribution -> Inference model 
¢ Variational approximation 
¢ Recognition model 
¢ Inference network (if parameterized as neural networks) 
¢ Recognition network (if parameterized as neural networks) 
¢ (Probabilistic) encoder 


¢ "The Model’ (prior + conditional, or joint) -> Generative model 
¢ The (data) likelinood model 
¢ Generative network (if parameterized as neural networks) 
¢ Generator 
¢ (Probabilistic) decoder 
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Burwun 
Recap: Variational Inference 


¢ Consider a generative model pg(x|z), and prior p(z) 
¢ Joint distribution: pg (x, Z) = pe(x|z)p(z) 
- Assume variational distribution qg(z|x) 


¢ Objective: Maximize lower bound for log likelinood 


log p(x) 
KL (ag (zlx) } po(z\x)) =a | qe (z|x) ie (x, Z) 


Ag (Z|x) 





Po (x, Z) 
Td (z|x) 





V 


> | aplx) log 
= L(0, db; x) 
¢ Equivalently, minimize free energy 


F (0,0; x) = —log p(x) + KL(qg(z|x) || pe(z|x)) 
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Recap: Variational Inference 


Maximize the variational lower bound L(8, @; x) 
¢e E-step: maximize L£ wrt. d@ with @ fixed 


maxgLl(O, h; x) = Egy ¢zixyllog pe (x1z)] + KL(qg (zIx) Ip) 
¢ If with closed form solutions 


dy (Z|x) « exp[log pg (x, Z)] 
¢ M-step: maximize L wrt. 8 with @ fixed 


maxgLl(O, h; X) = Eg, ¢z)x)Llog po (x1z)] + KL(qg(Z|x)||p(Z)) 
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Burwun 
Recap: Amortized Variational Inference 


¢ Variational distribution as an inference model qg(z|x) with 
parameters @ 

¢ Amortize the cost of inference by learning a single data- 
dependent inference model 

¢ The trained inference model can be used for quick inference 
on new data 


¢ Maximize the variational lower bound L(8, @; x) 
¢ E-step: maximize L wrt. @ with @ fixed 
¢ M-step: maximize L wrt. @ with @ fixed 
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Deep generative models with amortized inference 


¢ Helmholtz machines 


¢ Variational autoencoders (VAEs) / Neural Variational Inference 
and Learning (NVIL) 


¢ We will see later that adversarial approaches are also Included 
In tne list 
¢ Predictability minimization (PM) 
¢ Generative adversarial networks (GANSs) 
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Wake Sleep Algorithm 


¢ [Hinton et al., Science 1995] 
¢ Train a separate inference model along with the generative model 
¢ Generally applicable to a wide range of generative models, e.g., Helmholtz machines 
¢ Consider a generative model pg(x|z) and prior p(z) 
¢ Joint distribution pg (x, Z) = po(x|z)p(z) 
¢ E.g., multi-layer brief nets 
° Inference model qg(z|x) 


¢ Maximize data log-likelinood with two steps of loss relaxation: 
¢ Maximize the lower bound of log-likelinood, or equivalently, minimize the free 
energy 
F (0, b; x) = —log p(x) + KL(qg (Z|) || pe (Z|x)) 
¢ Minimize a different objective (reversed KLD) wrt @ to ease the optimization 
¢ Disconnect to the original variational lower bound loss 
F'(0, @; x) = —log p(x) + KL (po (2|x) || 9g (Z|x)) 
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Wake Sleep Algorithm 





¢ Free energy: 
F(0, @; x) = —log p(x) + KL(qg (ZI) || pe(Zz|x)) 
¢ Minimize the free energy wrt. 8 of pg > wake phase 
Naxg Lag (2|x) log po (x, Z)| 
- Get samples trom qg(z|x) through inference on hidden variables 


¢ Use the samples as targets for updating the generative model pg (z|x) 
¢ Correspond to the variational M step 


[Figure courtesy: Maeis slides] © Petuum,Inc. 102 


Wake Sleep Algorithm 


¢ Free energy: 
F(0, p; x) = —log p(x) + KL(qg(ZIx) || pa(Z|x)) 
¢ Minimize the tree energy wrt. @ of qg(z|x) 


¢ Correspond to the variational E step 
¢ Difficulties: Do(Z, x) 


* Optimal { pe(z,x) dz intractable 
¢ High variance of direct gradient estimate VgF(@,6; x) = ++ VeEgyczjxllog po (Zz, x)] + - 
¢ Gradient estimate with the log-derivative trick: 
VsEq,llog pel = J Veqglog pe = J qglog pe Velog qg = Eq, [log pe Volog 9g] 
¢ Monte Carlo estimation: 
VpEg,llog pe] ~ Ez, q,llog pe (% 21) Vode (i1*)] 


¢ The scale factor log pg of the derivative Vglog qg can have arbitrary 
large magnitude 


qg(Z|x) = 
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Wake Sleep Algorithm 





¢ Free energy: 
F (0, p; x) = —log p(x) + KL(qg (IX) || pe (Z|x)) 
¢ WS works around the difficulties with the sleep ohase approximation 
¢ Minimize the following objective > s/eeo phase 
F’'(0, p; x) = —log p(x) + KL(pe (ZI) || 9g (Z1*)) 
maxg Eno(zx) log V¢ (z|x)| 


¢ “Dreaming” up samples from pg(x|z) through top-down pass 
¢« Use the samples as targets for updating the inference model 


¢ (Recent approaches other than sleep phase is to reduce the variance of 
gradient estimate: slides later) 
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Wake Sleep Algorithm 


Wake sleep Variational EM 

* Parametrized inference model qg(z|x) ¢ Variational distribution q¢g(z|x) 

¢ Wake phase: ¢ Variational M step: 
° minimize KL(qg(z|x) || pe(z|x)) wrt. @ ° minimize KL(qg(2Z|x) || pe(z|x)) wrt. @ 
* Egg (zix) Velog pe(xIz)] * Eag(zix) Voelog pe(xlz)] 

¢ Sleep ohase: ¢ Variational E step: 
* minimize KL(pg(z|x) || qg(Z|x)) wrt. d ¢ minimize KL(qg(z|x) || pe(z|x)) wrt. d 
Bates [Velog dp (Z, x) | * ay % exp[log pe] if with closed-form 
¢ low variance ° VeEg, [log pg (z, x)| 
* Learning with generated samples of x * need variance-reduce in practice 


¢ Learning with real data x 


¢ Two objective, not guaranteed to converge’ * _ single objective, guaranteed to converge 
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Variational Autoencoders (VAEs} 


¢/Kingma & Welling, 2014 


¢ Use variational inference with an inference model 
e Enjoy similar applicability with wake-sleep algorithm 


¢ Generative model pg(x|z), and prior p(z) sg. -- (a) 6 
¢ Joint distribution pg (x, Z) = po(x|z)p(z) 
Ag (2Z|x) Do(x|Z) 
e Inference model dé (z|x) inference model Ox generative model 


Figure courtesy: Kingma & Welling, 2014 
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Variational Autoencoders (VAEs} 


¢ Variational lower bound 
LO, b; X) = Eqy(zix)llog po (x, Z)] — KL(qg (ZX) || pj) 


¢ Optimize L(O, d; x) wrt. 8 of pg(x|z) 
¢ The same with the wake phase 


- Optimize L(O, b; x) wrt. b of qg(z|x) 


VpL(O,b; x) = ++ + Vg Egy czixy[log pe (x|z)] + -- 


¢ Use reoarameterization trick to reduce variance 


¢ Alternatives: use control variates as in reinforcement learning [Mnih & 
Gregor, 2014; Paisley et al., 2012] 
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Reparametrized gradient 


° Optimize L(O, ph; x) wrt. d of qg(z|x) 
¢ Recap: gradient estimate with log-derivative trick: 
Vokay Llog Po (x, Z)| — hag [log Po (x, Z) Vplog Ad | 
¢ High variance: VpEq,llog pe] ~ Ez, q, [log Pe (*, Zi) Voqg (Zilx)] 
¢ The scale factor log pg (x, z;) of the derivative Vglog qg can have arbitrary large 
magnitude 
¢ gradient estimate with reoarameterization trick 
zZ~qg(zZlx) @& Z=8¢(6~), E ~ p(€) 
VpEgy(zix) log pe (x, Z)] = Ee~pce) Vplog Do (x, Z¢ (e))| 


¢ (Empirically) lower variance of the gradient estimate 
© Eg., z~ N(u(x),L(X)L(x)') © €~N(0,1), z=H(x) + Lie 
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VAEs: algorithm 


Algorithm 1 Minibatch version of the Auto-Encoding VB (AEVB) algorithm. Either of the two 
SGVB estimators in section 2.3 can be used. We use setiings J = LOO and L = | in experiments, 


0. <— Initialize parameters 


repeat 
X”’ — Random minibatch of \/ datapoimts (drawn from full datasct) 
€ + Random samples from noise distribution p/€) 
ge Veel™ (0, dé; X™, €) (Gradients of minibatch estimator (8)) 
6. & + Update parameters using gradients g (e.g. SGD or Adagrad |DHS10]) 
until convergence of parameters (8, @) 
return @. 


[Kingma & Welling, 2014] 
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Bursun 
VAES: example results 


VAEs tend to generate blurred 
images due to the mode covering 
behavior (more later) 





Celebrity faces [Radford 2015] 


Latent code interpolation and 
sentences generation from VAEs 
[Bowman et al., 2015]. 


1 want to talk to you. ” 


“? want to be with you. ” 
“. do n't want to be with you . 
1do nt want to be with you . 


)? 


she did n’t want to be with him. 
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Generative Adversarial Nets (GANs) 


¢ [Goodfellow et al., 2014] 
¢ Generative model x = Gg(z), z~ p(z) 
¢ Map noise variable z to data soace x 
* Define an implicit distribution over x: pg, (x) 
¢ a stochastic process to simulate data x 
¢ Intractable to evaluate likelinood 
* Discriminator Dg (x) 
¢ Output the probability that x came from the data rather than the generator 
¢ No explicit inference model 


¢ No obvious connection to previous models with inference networks like VAES 
¢ We will build formal connections between GANs and VAEs later 


© Petuum,Inc. 111 


Generative Adversarial Nets (GANs) 


¢ Learning 


¢ Aminimax game between the generator and the discriminator 


¢ Train D to maximize the probability of assigning the correct label to both 
training examples and generated samples 


¢ Train G to fool the discriminator 
Nax - lop = oe iiee ake) log D(a), + ele eee. log(1 _ D(2a))| 
ming Le — Hew G(z),2~p(z) log(1 — D(a))| 


T real) 
Ovtake) 


7 (reai) 





— Discriminator training 
— Generator training 


fake t © Petuum,Inc. 112 
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Generative Adversarial Nets (GANs) 


¢ Learning 
¢ Train G to fool the discriminator 
¢ The original loss suffers from vanishing gradients when D is too strong 
¢ Instead use the following in practice 


MaxG Le —= Beem x), zevn(z) log D(a)| 


T Real) 
Ovtare ) 


7 (reat) 





— Discriminator training 


shes i fake image # Generator training © Petuum,Inc. 113 
[Figure courtesy: Kim's slides] 
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Generative Adversarial Nets (GANs) 


¢ Learning 


¢ Aim to achieve equilibrium of the game 


¢ Optimal state: 


° Dg (x) = Paata(*) 
; _ Pdata() = 1 
Rie) eee 





real image 





fake image 


[Figure courtesy: Kim's slides] 


T Real) 
cs site x01) Oitake) 
1 reai) 


— Discriminator training 


— Generator training 
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GANs: example results 
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Generated bedrooms [Radford et al., 2016] 
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Bunun 
Outline 


¢ Overview of advances in deep generative models 


¢ Backgrounds of deep generative models 
¢ Wake sleep algorithm 
¢ Variational autoencoders 
¢ Generative adversarial networks 


¢ A unified view of deep generative models 
¢ new formulations of deep generative models 
¢ Symmetric modeling of latent and visible variables 


Z Hu, Z YANG, R Salakhutdinov, E Xing, 
“On Unifying Deep Generative Models’, arxiv 1706.00550 
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A unified view of deep generative models 


¢ Literatures have viewed these DGM approaches as distinct 
model training paradigms 
¢GANs: achieve an equilibrium between generator and discriminator 
¢eVAES: maximize lower bound of the data likelinood 


¢ Let's study a new formulation for DGMs 
«Connects GANs, VAEs, and other variants, under a unified view 


¢ Links them back to inference and learning of Graphical Models, and the 
wake-sleep heuristic that approximates this 


¢ Provides a tool to analyze many GAN-/VAE-based algorithms 


¢ Encourages mutual exchange of ideas from each individual class of 
models 
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Adversarial domain adaptation (ADA) 


¢ Let's start from ADA 
¢ The application of adversarial approach on domain adaptation 


¢ We then show GANs can be seen as a Special case of ADA 
¢ Correspondence of elements: 


a data/generation features 
a oN aomains 
Z code vector , 
domains 


Real/fake indicator Source/target 
domain indicator 





data feature 
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Adversarial domain adaptation (ADA) 


¢ Data z from two domains indicated by y € {0,1} 
¢ Source domain (y = 1) 
¢ Target domain (y = 0) 
¢ ADA transfers prediction knowledge learned from the 
source domain to the target domain 
¢ Learn a feature extractor Gg: x = Gg(Zz) 
¢ Wants x to be Indistinguishable by a domain discriminator: 
Do (x) 
¢ Application in classification 
¢ E.g., we have labels of the source domain data 
¢ Train classifier over x of source domain data to predict the 


labels 
| a, | | _— | data feature 
¢ x is domain invariant > x is predictive for target domain 


data 
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Busum 
ADA: conventional formulation 


* Train Dg to distinguish between domains 
MNaxdg Le — Ke=Go (z),z~p(z|y=1) log Dg(x)| a0 a ae (z),z~p(z|y=0) log(1 _ Dg(x)), 
° Train Gg to tool Dg 


maxg Le = Eg—Go(z),z~p(z|y=1) (log. — De(#))| + Ez=co(z),2z~p(z|y=0) [log Do (x), 
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Bunyuy 
ADA: new formulation 


¢ To reveal the connections to conventional variational approaches, let's rewrite 
the objectives in a format that resembles variational EM 


¢ Implicit distribution over x ~ pg(X|y) 
x = Ge(Z), z ~ p(2ly) 
¢ Discriminator distribution qg(y|x) 
do(y|x) = ag — y|x) 
¢ Rewrite the objective in the new form (up to constant scale factor) 
max Ly = Ep, (ely)p(y) [log go (yl@)| 
maxe Lo = Ep, (z\y)p(y) {log q(y|a), 
¢ ziS encapsulated in the implicit distribution pg (x|y) 
maxg Ly = Ep, (e\y=0)p(y=0) [log qo(y = 0|@)] + Ep, (e\y=1)p(y=1) [log ae(y = 1]2)| 
7 -Ee=Go(2).2~v(ely=0) log(1 — Dg(x))] + Moctieteaanyielets log De(x)| 


2 
« (Ignore the constant scale factor 1/2) © Petuum.Inc. 122 


Bunyuy 
ADA: new formulation 


¢ New formulation 


maxg Ly = Ep, (aly)p(y) [log qo(y|x), 
maxg Ly = Ep, (a\y)p(y) [log a3 (ula), 


¢ The only difference between @ and ¢: q vs. q’ 
¢ This is where the adversarial mechanism comes about 
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ADA vs. Variational EM 


Variational EM ADA 
¢ Objectives 


¢ Objectives 
maxgloo = Egyczix)llog pe (x1z)] + KL (qg(zlx)|Ip(2)) 


maxg £4 = Ep, (x\y)p(y) log qo (y|x), 


maxglyg = Egy (zjxllog po (x1z)] + KL (ag (zlx)|Ip(2)) maxg Lo = Ep, (e\y)p(y) (los a(y\)| 
| ae ¢ Two objectives 
¢ Single objective for both 8 and ¢ | | 
| a ¢ Have global optimal state in the game 
¢ Extra prior regularization by p(z) 


theoretic view 


© Petuum,Inc. 124 


ADA vs. Variational EM 


Variational EM ADA 
¢ Objectives ¢ Objectives 
maxgloo = Egyczix)llog pe (x|z)| + KL (44 (z|x)|Ip(2)) maxg Lg = Ep, (e\y)p(y) [log de(y|@)| 
maxglgo = Eq,z|x)llog pe (x|z)] + KL (ag (zlx)|Ip(2)) maxe Lo = Ep, (ely)p(y) (log a5 (y|x)| 
| oo, ¢ Two objectives 
¢ Single objective for both 8 and ¢ | | 
| a ¢ Have global optimal state in the game 
¢ Extra prior regularization by p(z) theoretic view 
e The reconstruction term: maximize the conditional ¢ The objectives: maximize the conditional 
log-likelinood of x with the generative distribution log-likelinood of y (or 1 — y) with the 
Pe (x|z) conditioning on the latent code z interred distribution qg(y|x) conditioning on latent 
Dy gg(Z|x) feature x inferred by pa(xlv) 


a atl 


° pg(x|z) Is the generative model 





a atl 


¢ Interpret qg(y|x) as the generative model 


3 z|x) is the inference model ¢ Interpret pa(x|v) as the inference madel 


ADA: graphical model 


Detine: 


¢ Solid-line arrows (x > y): 
¢ generative process 
¢ Dashed-line arrows (y,z > x): 
¢ inference 
¢ Hollow arrows (z > x): 
¢ deterministic transformation 
¢ leading to implicit distributions cr EF ] hie) 
max = ‘a O x 
¢ Blue arrows (x > y): — ee _ ’ 
¢ adversarial mechanism maxg Lo = Keng (aly) p(y) ‘log q's (y|a) | 
- involves both qg(y|x) and q¢(y|x) 
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GANs: a variant of ADA 


¢ Transfer the properties of source domain to target domain 
¢ Source domain: e.g. real image, y = 1 
¢ Target domain: e.g. generated image, y = 0 





data feature code data/gen 


ADA GANs 
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GANs: a variant of ADA 


¢ Implicit distrioution over x ~ pg(xly) 


Dag (2 ios l (distribution of generated images) 
po(xly) = a0 (@) 7 ao. | 
Pdata(&) Yy=1. (distribution of real images) 


°X~ Dg, (X) @ xX = Ge(Z), Z ~ p(zly = 0) 
°x~ Paata(X) 


¢ the code space of z is degenerated 
¢ sample directly from data 
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Bunun 
GANs: new formulation 


¢ Again, rewrite GAN objectives in the ‘variational-EM” format 
¢ Recap: conventional formulation: 


maxg Lg = Ep—G,(z),z~p(z|y=0) log(l — De(#))] + Eaxnprara(a) log Do(x)| 
maxe Lo = Ez—Go(z),z~p(z|y=0) [log Do(®)| + Eaxpaara(a) llog(l — Dg(x)), 
= Ex=G,(z),2~p(z|y=0) llog De (x), 
¢ Rewrite in the new form 
maxg Lg = ng (ely) p(y) log qg(y|x), 
maxe Ly = Ep, (x\y)p(y) [log ay(yl@)| 


¢ Exact the same with ADA | 
¢ The same correspondence to variational EM |! 





© Petuum,|Inc. 129 


GANSs vs. Variational EM 





Variational EM GAN 
¢ Objectives ¢ Objectives 
maxgloo = Egyczix)llog pe (x|z)| + KL (44 (z|x)|Ip(2)) maxg Lg = Ep, (e\y)p(y) [log de(y|@)| 
maxglgo = Eq,z|x)llog pe (x|z)] + KL (ag (zlx)|Ip(2)) maxe Lo = Ep, (ely)p(y) [log 95 (yl@) | 
| oo, ¢ Two objectives 
¢ Single objective for both 8 and ¢ | | 
| a ¢ Have global optimal state in the game 
¢ Extra prior regularization by p(z) theoretic view 
e The reconstruction term: maximize the conditional ¢ The objectives: maximize the conditional 
log-likelinood of x with the generative distribution log-likelinood of y (or 1 — y) with the 
Pe (x|z) conditioning on the latent code z interred distribution qg(y|x) conditioning on 
by dg(z|x) data/generation x inferred b x 





* Po(x|z) Is the generative model ¢ Interpret qg(y|x) as the generative model 


3 z|x) is the inference model ¢ Interpret pa(x|v) as the inference madel 


GANSs vs. Variational EM 


¢ Interpret x as latent variables 





¢ Interpret generation of x as 
performing inference over latent 





Variational EM GAN 
¢ Objectives ¢ Objectives 
maxgloo = Egyczix)llog pe (x|z)| + KL (44 (z|x)|Ip(2)) maxg Ly = Epy(x\y)p(y) [log qo(y|x)| 
maxeLyg = Eq ,(zjx)llog po (xlz)] + KL (ag (2lx)|Ip(2) maxg Lo = Ep, (e\y)p(y) [los a,(y\2)| 
| oo, ¢ Two objectives 
¢ Single objective for both 8 and ¢ | | 
| a ¢ Have global optimal state in the game 
¢ Extra prior regularization by p(z) theoretic view 
e The reconstruction term: maximize the conditional ¢ The objectives: maximize the conditional 
log-likelinood of x with the generative distribution log-likelinood of y (or 1 — y) with the 
Pe (x|z) conditioning on the latent code z interred distribution qg(y|x) conditioning on 
by dg(z|x) data/generation x inferred b x 





* Po(x|z) Is the generative model ¢ Interpret qg(y|x) as the generative model 


3 z|x) is the inference model ¢ Interpret pa(x|v) as the inference madel 


GANs: minimizing KLD 


¢ As in Variational EM, we can further rewrite in the form of minimizing KLD 
to reveal more insights into the optimization problem 


¢ For each optimization step of pg(x|y) at point (6 = 49, = go), let 
* p(y): uniform prior distribution 


* Do=0,(X) = Epry)|Po=o, (xy) | 
- g' (aly) & dg=g, 1X) Pe=0,(*) 
¢ Lemma 7: The updates of @ at 85 have 


Ve | — Ep, (elyyp(y) [log do=¢o (y|ac) | | one, — 


Vo |Ep(y) [KL (po(aly)|Ia" (aly))] — JSD (vo(aly = 0)|lpo(aly = 1) | 


¢ KL: KL divergence 
¢ JSD: Jensen-shannon divergence 


0=05 
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Proof of Lemma 1 


Proof. 
Ey4 (@\y)p(y) [log g" (y|@)] = ss 
— E,(yy [KL (po (xy) Ia" (@ly)) — KL(wo(#ly)||P0.(@))], 
where 
Ey) [KL(po(x|y)||Po, (#))| 
= p(y =0) KL (po(aly = 0) |X U= OF Poole =D) 4 


Po (xly = 0) + po (ely = 1) 
+ ply = 1) KL (pa(aly = 1) P= OF Peal =D) ) 
g =F ata 
Note that po(x|y = 0) = pg, (x), and po(x|y = 1) = paata(x). Let py, = “2-5. Eq.(4) can 
be simplified as: 


1 
~KL (Paata||P Mo, ) ; (3) 


Ey) [KL(po(@ly)|lp9 (@))] = 5KL (pyolleate,) + 5 


2 
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Proof of Lemma 1 (cont.) 


On the other hand, 





1 D 1 Pdata 
ISD (pgp ||Paata) = 5 Ep, og Pa | —+ a Ep sata og = t | 














2 Meo 2 PMo 
1 1 
= zE,,, |lo P ge -E,, og ad 
2 : PMo, Z PMg 
it Pdata 1 | PMo 
+ —Ey 401. | log ——| + =Ep..., |log —— (6) 
) Pd | PMo, 9) Pd PM 
= ly log P96 4 la log ats + EK og at, | 
») Pgo PMe, y) Pdata PMo, PMg PMo 











= 5KL (Dog Pato, ) + =KL (Paatal|PMo, ) — KL (pm, Pato, ) : 
Note that 
VeKL (pay ||P_e, ) lo=00 = 9. (7) 
Taking derivatives of Eq.(5) w.r.t 8 at Ag we get 
VoE py) [KL(po(2|y)||Poo(#))] lo=0o 
= Vo (FKL (Py Pat) lomdy + ZKL (PaatalPate,)) lea 8) 


2 
= V oISD (Do, \|Paata) |6=9 


Taking derivatives of the both sides of Eq.(3) at w.r.t 8 at @p and plugging the last equation of Eq.(8), 
we obtain the desired results. CT © Petuum, Inc. 134 


GANs: minimizing KLD 


¢ Lemma 7: The updates of 8 at 85 have 


Ve | ~ eg (wly)p(y) log Ié=d0 (y|a) | | -_ = 


Vo p(y) KL (pe (xly)\|a" (x|y))] — JSD (pe(xly = 0)||pe(xly = 1)) 


¢ Connection to variational inference 
¢ See x as latent variables, y as visible 
° Pe=e,(%): prior distribution 
* g"(xly) X qg=¢, (1X) Pe=6,(*) : posterior distribution 
° Dg(xly): variational distribution 
¢ Amortized inference: updates model parameter @ 


¢ Suggests relations to VAEs, as we will explore shortly 


0=0 
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GANs: minimizing KLD 


¢ Lemma 7: The updates of 8 at 84 have 


Ve | ~ Eng (xe|y)(y) log Ib=d0 (y|az) | | oo, _ 


Vo p(y) [KL (pe (aly) |" (@]y))] — ISD (po (aly = 0)|lpe(ely = ))| os, 


+ Minimizing the KLD drives pg, (x) tO Paata(X) 


+ By definition: po=9,(2) = Epiy)[Po=o5(1Y)] = (Popo) + Paara(®) J / 2 
* KL(pe(xly = DI Iq" ly = 1)) = KL(paata(x)I1q" ly = 1)) : constant, no free parameters 
* KL(pe(xly = 0)[|q” «ly = 0)) = KL (Pg (x)I1a" xly = 0) - parameter 6 to optimize 
- g' (aly = 0) & qg-g, (Y = 01x) pe=0, (*) 
* seen as a mixture Of pgo_ 9 (*) ANd Daata(X) 
¢ mixing weights induced from 1¢=6, 0 = 0|x) 
° Drives pg, (xy) to mixture of Pgo=0, (*) And Daata(X) 
= Drives pg, (x) tO Daata(*) eS POM TE 20 


GANs: minimizing KLD 


Po=0, (Xly = 1) = Paata(X) Pe=0, ly = 0) = Pgg_¢, (*) 


--~ 










Po=enrew (x|y — 0) — Pgo_gnew (x) 


¢ 


+ Minimizing the KLD drives pg, (X) tO Paata(X) 


+ By definition: po=9,(2) = Epy)[Po=o5(X1Y)] = (Popo (®) + Paata(®) J / 2 
* KL(pe(xly = DI Iq" ly = 1)) = KL(pgata(x)I1q" ly = 1)) : constant, no free parameters 
* KL(poe(xly = 0)[Iq” ly = 0)) = KL (P94 (x)I1a" xly = 0) - parameter 6 to optimize 
° g' (aly = 0) & qg-g, (Y = 01x) pe=0, (*) a. (ylx) 
* seen as a mixture of pg,_ 9, (*) And Paata(X) / 
¢ mixing weights induced from 1p=6, 0 = 0|x) ‘4 
° Drives pg, (x|y) to mixture of pg,_ 9 (*) ANd Daata(X) 
= Drives pg, (x) tO Daata(*) POMEL TEI ote 






o(aly) 


GANs: minimizing KLD 


P= (xly = 1) = Daata(*) Po=6, (xly = 0) = Pg6=6, (x) 
q' (x|y = 0) pS —» 
’ ‘ missed mode 


i ~~ x 7 > 






Po=enrew (x|y a 0) — Pgo_gnew (x) 






¢ Missing mode phenomena of GANs KL (pg, (x)I1q" («ly = 0)) 
— [ Po log 


Pap (x) 


¢ Asymmetry of KLD ; 
q’ (x|y = 0) 


¢ Concentrates pg(x|y = 0) to large 
modes of gq’ (x|y) 


| ¢ Large positive contribution to the KLD in the 
=> Dg,(Xx) misses modes Of Dgata(*) 


regions of x space where q’ (x|y = 0) Is 





¢ Symmetry of JSD small, unless p,, (x) is also small 
- Does not affect the behavior of * = Pg, (x) tends to avoid regions where 
mode missing q” (x|y = 0) is small 
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GANs: minimizing KLD 


¢ Lemma 7: The updates of 8 at 85 have 


V 6 | 7 Bing (x|y) p(y) log Vo (yla)| | ono, 7 


Vo p(y) [KL (po(xly)|lq' (@ly))| — JSD (pe(aly = 0)||pe(ely = 1)) 
¢ No assumption on optimal discriminator 11%) 
¢ Previous results usually rely on (near) optimal discriminator 
° ay = 1X) = Daata(®)/(aata(*) + Pg (X)) 
* Optimality assumption is impractical: limited expressiveness of Dg [Arora et al 2017] 


¢ Our result is a generalization of the previous theorem [Arjovsky & Bottou 2017] 
¢ Plug the optimal discriminator into the above equation, we recover the theorem 


0=60 


, 1 
V6 — Eng (x\y) p(y) [log do (y|a)| | : = Vo KL (Dgp ||Paata) — JSD (pgp \Panin) 


6=0 0=6o 


¢ Give insights on the generator training when discriminator Is optimal 
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GANs: minimizing KLD 


In summary: 


¢ Reveal connection to variational inference 
¢ Build connections to VAEs (slides soon) 
¢ Inspire new model variants based on the connections 


¢ Offer insights into the generator training 
¢ Formal explanation of the missing mode behavior of GANs 


¢ Still hold when the discriminator does not achieve its optimum at each 
iteration 
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Variant of GAN: InfoGAN 


an(z|z.y) aS? (yla) 





¢ GANs don't offer the functionality of inferring code z given dat 
¢ INTOGAN [Chen et al., 2016] 
¢ Introduce inference model Q,(z|x) with parameters 7 
¢ Augment the objectives of GANs by additionally inferring z 
maxp Lp = Egapia,(x) log D(x)| + Eg a(z),z~p(z) llog(1 — D(x))], 
maxg,qg £a,q = Exzn~G(z),z~p(z) [log D(x)+ log Q(z|x)|. 


GANs 





code data/gen code _ data/gen © Petuum,Inc. 141 


Bunyuy 
INfoGAN: new formulation 


an(z|z.y) aS? (yla) 





* Defines conditional q,(z|x, y) 


° dy(Z|x, y = 1) Is fixed without free parameters to learn 
¢ As GANs assume the code space of real data is degenerated 
* Parameters n are only associated with q,(z|x,y = 0) 


¢ Rewrite in the new form: 


maxg Ly = Bing (ely) p(y) log dn (2|@, y)dely|x)| 
maxg,n Lo. = Eng (x\y)p(y) log In (Z| a, y)qg(ylx) | 
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GANSs vs InfoGAN 








code data/gen code data/gen 
(1) 
2 dg G an(z|e.y) af? (yl) 
‘4 Pas 
Cue 





maxg Ly = Ep, (aly)p(y) log de(yl@)] = maxg Le = Ep, (a|y)p(y) [log in (Z/@, y)ao(yl@)) 
maxe Lo = Ep, (elyyp(y) [log ag(yl@)] maxon Lon = Epo(elyyp(y) [108 In (22,4) Gayla) 


aA 


PETUUM 


INfoGAN: new formulation 


an(z|x.y) aS? (yla) 


¢ Similar results as in GANs hold: 
- Let q' (x|z, y) x An=n (Z|x, Y)Ip=9, 1X) Pe=6, (x) 





e We have: 


Vo | —Enotelnot) [108 dy (zee. 9)4h5 (ul) ]|,_,, = 
Vo [Enty) [KL (po(=ly) lla" (sez, ¥))] — ISD (po(ely = 0)|ipolwly = 1)) |] 


¢ Next we show correspondences between GANSs/InfoGAN and 
VAES 
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Relates VAEs with GANs 


¢ Resemblance of GAN generator learning to variational 
inference 


¢ Suggest strong relations between VAEs and GANs 


¢ Indeed, VAES are basically minimizing KLD with an opposite 
direction, and witn a degenerated aaversarial discriminator 


(r) (r) 
Mn(z|@,y) ay (y|x) Qn(z|@,y) de  (yl@) <__ degenerated 


discriminator 





swap the generation (solid-line) 
and inference (dashed-line) 
processes of InfoGAN 








INfoGAN 
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Burwun 
Recap: conventional formulation of VAEs 


¢ Objective: 
maxgn» Li = Ep,,,, (x) [Eg,(z\a) log po (a|z)| — KL(G,(z|x)||(z)). 
¢ p(z): prior over z 
* Dg(x|z): generative model 


° Gy (z|x): inference mode! 
¢ Only uses real examples from paatg(X), lacks adversarial mechanism 


¢ To align with GANs, let's introduce the real/fake indicator y and 
aaversarial discriminator 
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VAEs: new formulation 
Un (Z|x, y) qs”? (y|x) 
¢ Assume a perfect discriminator q,(y|x) 
° g.(y = 1|x) = 1 If x is real examples 
° g.(y = O|x) = 1 If x is generated samples 
° gs (yx) *= 4.(1 — y|x) 
¢ Generative distribution 


po(elena) = (mee) vat 





° Let pg(z, y|x) « pg(x1z, y)p(zly)pO) 
¢ Lemma 2 


vae 


bin = 2° Eps, (w) (Eg, (z\e,y)a" (yle) [log po(#|z, y)| — KL(qn(z|x, y) a (y|@)\|v(zly)p(y))| 
=2-E —KL (Gy (z|x, ya (y|@) ||po(z, y|a@))] . 





Po, (x) 
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Lemma 2: sketch of proof 


¢ Lemma 2 


vae 


bn = 2° Eps, (w) (Eq, (zla,y)a"(ylx) log po (az, y)] — KL(an(zl@, yar (lx) Ilp(zly)p(y))| 
=2-E |—KL (q,(z|x, ya (y|@) I|po(z, y|@))] . 





P6q (a) 
¢ Proot 
i 1 
1) Expand Ep, abd = 5 Ep, ciy=1b |] +5 Ep, cy=oyl-| 


1 | 
2) 5 eng, (x|y=0) |. | IS constant 
¢ Due to the perfect discriminator gi (y|x) 
¢ Blocks out generated samples in the training loss 


1 1 
3) SEpe cly=nl- |] = 5 Epagatacl: | 


« Recovers the conventional formulation © Petuum,Inc. 148 


Proof of Lemma 2 


Proof. For the reconstruction term: 


Ep, (@) Eq, (z\e,y)a% (yle) log pa(a|z,¥)) 

] 
= 5 Poo (wly=1) a, Calas 0) aD tyiee) log po(x]z,y = 0)]| 

1 (25) 
+ 5 P69 (wly=0) Pig Lathan sit apm dnodt ain) log po(x|z,y = 1)]| 

] 2 
— 5 Pata (a) Eg, (2|a) log po(a|z)]| + const, 
where y = 0 ~ qi (y|x) means gq? (y|az) predicts y = 0 with probability 1. Note that both q,(z|x, y 
1) and pe(x|z, y = 1) are constant distributions without free parameters to learn; g,(z|x, y = 0) 
dn (z|x), and po(x|z,y = 0) = po(x|Z). 
For the KL prior regularization term: 

En, (w) [KL(dn(2|x, waz (y|@) |p(z|y)P(y))| 


= Epa) | f af (ule)KL (ay (2\a,u))lpCely)) dy + KL (a (yle))o(u) 
: (26) 


i} 
— 5 pag (wly=1) KL (q,(z|a, y = 0)||p(z|y = 0)) + const] + 5 E peg (#ly=1) [const] 


= SE aaca(a) KL(dn(z|0)||6(2))]. 


Combining Eq.(25) and Eq.(26) we recover the conventional VAE objective in Eq.(7) in the paper. L 
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GANSs vs VAEs side by side 


Generative 
distribution | Polly) = et z) y= Pdata(@) y= 1. 


Ap (|x) q.(y|x), perfect, degenerated 
An (z|x, y) Of InfoGAN dy (Z|X, y) 


ming KL (pe(xly) || 9g" (lz, y)) | mingKL (Gy (al, yar (yx) || pe (z, ylx) 


{Pate y =0 poate, = {reele) v= 9 





KLD to 
minimize 


~ mingKL(P¢ || Q) ~mingKL(Q || Pa) 
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Link back to wake sleep algorithm 


¢ Denote 
¢ Latent variables h 
¢ Parameters A 


¢ Recap: wake sleep algorithm 
Wake: maxg Igy (ble) Paata() log pe (xh) 


Sleep : Max, Eng (alh)p(h) log gq) (hla) 
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VAEsS vs. Wake-sleep 


¢ Wake sleep algorithm 
Wake: maxg Eg, (h\ax)paaia(x) log pe(x|h), 
Sleep: max) E,,(2|n)p(n) (log qa (h|x)| 
eLethbez,anddAbey 
=> maxg Eg, (z\2)paata(a) log pe(#|Z)| , recovers VAE objective of optimizing @ 
¢ VAEs extend wake phase by also learning the inference model (7) 
maxg, Ls, = Eg, (2\2)paeta(a) [log pe(@|z)] —Ep,.,., (x) [KL(¢n(2|@) ||p(z))] 
¢ Minimize the KLD in the original variational free energy wrt. 7 
¢ Stick to minimizing the wake-phase KLD wrt. both 8 and n 


¢ Do not involve sleep-phase objective 
¢ Recall: sleep phase minimizes the reverse KLD in the variational free energy 
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GANs vs. Wake-sleep 


¢ Wake sleep algorithm 
Wake: maxe Ey, (hje)prara(w) [log po(#|h)| 


Sleep : Max, Bing (ale oth) log gy (hla) 
eLethbey,anddAbe@®d 
= max¢ Ep, (aly)p(y) log gs(y|x)], recovers GAN objective of optimizing 


¢ GANs extend sleep phase by also learning the generative model (@) 
* Directly extending sleep phase: maxg Lo = Ep, (x}y)p(y) log ag(y|x)| 
¢ GANS: max Lo = Ep, (|y)p(y) [log qs (yx), 
* The only difference is replacing qg with dp 
¢ This is where adversarial mechanism come about | 
¢ GANs stick to minimizing the sleeo-ophase KLD 
¢ Do not involve wake-phase objective 
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Mutual exchanges of ideas: augment the loss functions 


KLD to ming KL (pe (aly) || gq’ (xz, y)) | mingKL(q, (21x, y)q: IX) II pe (Zz, yx) 


minimize ~ mingKL(Po || Q) ~mingKL(Q || Pe) 





¢ Asymmetry of KLDs inspires combination of GANs and VAEs 
¢ GANs: mingKL(P,||Q) tends to missing mode 
e VAEs: | CE tends to cover regions with small values of Daata 
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Mutual exchanges of ideas: augment the loss functions 


KLD to ming KL (pe (aly) || gq’ (xz, y)) | mingKL(q, (21x, y)q: IX) II pe (Zz, yx) 


minimize ~ mingKL(Po || Q) ~mingKL(Q || Pe) 





¢ Asymmetry of KLDs inspires combination of GANs and VAEs 
¢ GANs: mingKL(P,||Q) tends to missing mode 
¢ VAES: mingKL(Q||P,) tends to cover regions with small values Of Daata 
¢ Augment VAEs with GAN Ioss [Larsen et al., 2016] 
¢ Alleviate the mode covering issue of VAES 
¢ Improve the sharpness of VAE generated images 
¢ Augment GANs with VAE Ioss [Che et al., 2017] 
¢ Alleviate the mode missing issue of GANs 
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Mutual exchanges of ideas: augment the graphical model 


Discriminator 
Ap (|X) q.(y|x), perfect, degenerated 


¢ Activate the adversarial mechanism in VAES 
¢ Enable adaptive incorporation of fake samples for learning 
¢ Straightforward derivation by making symbolic analog to GANs 


an(zix.y) gh (y\a) 






I N Ps 
P@ (x|Z, y) 
Vanilla VAEs Adversary Activated VAEs 
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Aaversary Activated VAEs (AAVAE) 


e Vanilla VAEsS: 


maxe,n Ly, = Eng, (w) |Eg, (zle,y)a"(y\x) log po(a|z, y)] — KL(dn(z lx, y)ar (yl@) Ilp(zly)p(y))| 


* Replace q,(y|x) with learnable one qg(y|x) with parameters @ 
- As usual, denote reversed distribution qg (yx) = ag (|x) 


maxe,n Lon = Epy (2) Eg, (xle,y)a% (ule) log pe (x|z, y)| — KL(qn(zla, y)qi,(yl@) ||p(zly)p(y)) 
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AAVAE: adaptive data selection 


davae 


maxe.n Lon = Epy. (@) Ean (zl2.0)95 (ule) log pe(x|z, y)| — KL(qn(z la, y)ai(yla)||p(z\y)p(u)) 


¢ An effective data selection mechanism: 
¢ Both generated samples and real examples are weighted by 
dg (y = O|x) = qgy = 11x) 
¢ Only samples that resembles real data and fool the discriminator will be used 
for training 


- Areal example receiving large weight q¢(y|x) 
=> Easily recognized by the discriminator as real 
= Hard to be simulated from the generator 
= Hard examples get larger weights 
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AAVAE: discriminator learning 


¢ Use the binary classification objective as in GAN 


hax ho = pg (alz,y)p(zly) p(y) log ge(y|x), 
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AAVAE: empirical results 


¢ Aoplied the aaversary activating method on 
¢ vanilla VAES 
¢ class-conditional VAEs (CVAE) 
¢ semi-supervised VAEs (SVAE) 
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AAVAE: empirical results 


¢ Evaluated test-set variational lower bound on MNIST 
¢ The higher the better 


*—« CVAE . 
e-* AA-CVAE ys 





01 a 1. a 1 1. as) 


¢ X-axis: the ratio of training data for learning (0.01, 0.1, 1.) 
e Y-axis: value of test-set lower bound 
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AAVAE: empirical results 


¢ Evaluated classification accuracy of SVAE and AA-SVAE 


1% 10% 


SVAE 0.9412+.0039  0.9768+.0009 
AASVAE  0.9425=+.0045  0.9797=-.0010 


¢ Used 1% and 10% data labels in MNIST 
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Mutual exchanges of ideas 


e AAVAE enhances VAEs with ideas from GANs 

¢ We can also enhance GANs with ideas from VAEs 

¢ VAES maximize a variational lower bound of log likelinood 
. Importance weignted VAE (IWAE) [Burda et al., 2016] 


¢ Maximizes a tighter lower bound through importance sampling 
¢ The variational inference interpretation of GANs allows the 
Importance weighting method to be straightforwardly applied 
to GANs 
¢ Just copy the derivations of IWAE side by side with little adaptions'! 
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Importance weighted GANs (IWGAN) 


¢ Generator learning in vanilla GANs 


maxg Expy (ely)p(y) [log 1%, (la) | 


¢ Generator learning in IWGAN 

kW (Y|&:) 
a log 95, (y |: 
1 dog (laa) 8 M00 UI? 


¢ Assigns higher weights to samples that are more realistic and fool the 
aiscriminator better 
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IWGAN: empirical results 


¢ Applied the importance weighting metnod to 
¢ vanilla GANS 
¢ class-conditional GANs (CGAN) 


¢ CGAN adds one dimension to code z to represent the class label 
¢ The derivations of the IW extension remain the same as In vanilla GANs 
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IWGAN: empirical results 


¢ Evaluated on MNIST and SVHN 


¢ Used pretrained NN to evaluate: 
¢ Inception scores of samples trom GANs and IW-GAN 


¢ Confidence of a pre-trained classifier on generated samples + diversity of 
generated samples 


MNIST SVHN 
GAN 8.34.03 5.18.03 
IWGAN ~ 8.45.04 5.34+.03 


¢ Classification accuracy of samples from CGAN and IW-CGAN 


MNIST SVHN 


CGAN  0.985=£.002 0.797-£.005 
IWCGAN — 0.987--.002 0.793.006 
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Burwun 
Recap: Variational Inference 


Maximize the variational lower bound L(@, @; x), or equivalently, 
minimize tree energy 


F (0,0; x) = —log p(x) + KL(qg (2x) || pe(Z|x)) 


¢ E-step: maximize L wrt. @ with @ fixed 
maxgLl(O, h; x) = Eq, ¢zixyllog pe (x1z)] + KL(qg (zl) |I[P(Z)) 
¢ If with closed form solutions 
dy (Z|x) « exp[log pg (x, z)] 
¢ M-step: maximize L wrt. @ with @ fixed 


maxgLl(O, h; xX) = Eg, ¢z)x)Llog pe (x1z)] + KL(qg(Z|x)||p(Z)) 
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Buvsum 
Discussion: Modeling latent vs. visible variables 
¢ Latent and visible variables are traditionally distinguisned 
clearly and modeled in very different ways 


¢ A key thought in the new formulation: 


¢ Not necessary to make clear boundary between latent and visible 
variables, 


¢ And between inference and generation 


¢ Instead treat them as a symmetric pair 
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Buvsum 
symmetric modeling of latent & visible variables 


¢ Helo with modeling and understanding: 


¢ Treating the generation space x in GANSs as latent 

¢ reveals the connection between GANs and ADA 

¢ orovides an variational inference interoretation of generation 
Treat generation of x 


“sas performing 
“ Inference 


=e — = = 
Pad ae 


Ss = = 
eee 


Inference on features 





data feature 


code  data/gen 


ADA GANs © Petuum,|Inc. 169 


Buvsum 
symmetric modeling of latent & visible variables 


¢ Helo with modeling and understanding: 


¢ Treating the generation space x in GANSs as latent 
¢ reveals the connection between GANs and ADA 
¢ orovides an variational inference interoretation of generation 


¢ Wake sleep algorithm 
¢ wake phase reconstructs visible variables based on latents 
¢ sleeo ohase reconstructs latent variables based on visibles 


¢ latent and visible variables are treated In a completely symmetric 
way 
Wake: mMaxg Lag (2|x) [log Do(Xx, Z)| 


sleep: Maxg Eng (z,x) log To (z|x)| © Petuum,|Inc. 170 


Buvsum 
symmetric modeling of latent & visible variables 


¢ New modeling approaches narrow the gap 


Empirical distributions over visible Prior distributions over latent variables 
variables 


¢ Impossible to be explicit distribution ¢ Traditionally defined as explicit distributions, e€.g., 


| , | Gaussian prior distribution 
¢ The only information we have Is . Amiable for likelihood at 
the observe data examples miable tor likelinood evaluation 


ap tk the t tr ¢ We can assume the parametric form 
of DTAE, FL rare 32 We 3 ane panies Geet earns 3 according to our prior knowledge 
form of data distribution 


+ Naturally an implicit distribution ¢ New tools to allow implicit priors and models 
¢ GANs, density ratio estimation, approximate 
Bayesian computations 

° E.g., adversarial autoencoder [Makhzani et al., 2015] 
replaces the Gaussian prior of vanilla VAEs 
with implicit priors 


¢ Easy to sample from, hard to 
evaluate likelinood 
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Bown 
symmetric modeling of latent & visible variables 
¢ No difference in terms of formulations 


¢ with implicit distrioutions and black-box NN models 
¢ just swap the symbols x and z 


Z~ Porior (Z) X ~ Paata (x) 
/ 
x ~ fotack—pox (2) Z~ f black-box (*) 


prior distr. 


Generation 
model 
Z~ Porior (Z) (a) 


xX [riace-Hox 2) 


x ~ Paata (x) 
ZO | pace-on*) 
ference 
odel 


In 
mM 


data distr. 
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Buvsum 
symmetric modeling of latent & visible variables 


¢ No difference in terms of formulations 
¢ with implicit distrioutions and black-box NN models 


¢ Difference in terms of soace complexity 
¢ depend on the problem at hand 
¢ choose appropriate tools: 
¢ implicit/explicit distribution, adversarial/maximum-likelinood optimization, ... 


adversarial loss Oo maximum likelinood loss 


maxg log Ol ioe | p) 


prior distr. prior distr. 


Generation Generation 
prior distr. 
model model Inference Inference 
model model 


~ _— S 


OPe 
maximum likelihood loss data distr. data distr. 





data distr. 





adversarial loss 


Bursun 
Part- | | Concl usions Z Hu, Z YANG, R Salakhutdinov, E Xing, 


“On Unifying Deep Generative Models’, arxiv 1706.00550 


¢ Deep generative models research have a long history 
¢ Deep blief nets / Helmholtz machines / Predictability Minimization /... 


¢ Unification of deep generative models 
¢ GANs and VAEs are essentially minimizing KLD In opposite directions 
¢ Extends two phases of classic wake sleep algorithm, respectively 


¢ A general formulation framework useful for 
¢ Analyzing broad class of existing DGM and variants: ADA/IntoGAN/Joint-models/... 


¢ Inspiring new models and algorithms by borrowing ideas across research fields 


¢ Symmetric view of latent/visible variables 
¢ No difference in formulation with implicit prior distributions and black-box NN 
transformations 
¢ Difference in soace complexity: choose appropriate tools 
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Plan 


¢ Statistical And Algorithmic Foundation and Insight of Deep 
| earning 


¢ On Unified Framework of Deep Generative Models 


¢ Computational Mechanisms: Distributed Deep Learning 
Architectures 
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\) 


Part-Ill(1) 


me ot 





Inference and Learning 


Ge* 
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Outline 


¢ Deep Learning as Dataflow Graphs 
¢ Auto-differentiable Libraries 
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Outline 


¢ Deep Learning as Dataflow Graphs 
¢ Auto-differentiable Libraries 
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A Computational Layer in DL 


¢ A layer in a neural network is composed of a few finer 
computational operations 
¢ A layer l has inout x and output z, and transforms x into z following: 
y= Wx+b,z = ReLU(y) 
¢ Denote the transformation of layer l as f;, which can be represented 
as a dataflow graphs: the Input x flow though the layer ‘ 





X Z 
<2 ————_» 
fi 
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From Layers to Networks 


¢ A neural network is thus a few stacked layers / = 1,...,L, where 
every layer represents a function transform f; 
¢ The forward computation proceeds by sequentially executing 


Pole fecwsti 
AW Alls. 4/9 


¢ Training the neural network involves deriving the gradient of its 
oarameters with a backward pass (next slides) 
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A Computational Layer in DL 


¢ Denote tne backward pass through a layer l as b, 
¢ b, derives the gradients of the input x(dx),given the gradient of z as 
dz, as well as the gradients of the parameters W, b 
¢ dx will be the backward input of its previous layer / — 1 


¢ Backward pass can be thought as a backward dataflow where the 
gradient flow through the layer 


dx dz 


i << 





by 
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Backpropagation through a NN 


¢ [ne backward computation proceeds by sequentially 
executing b,, by _-1, bi_2, .., by 


N22. - by 
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A Layer as a Dataflow Graph 


«Give the forward computation flow, gradients can be computed 
oy auto differentiation 


¢ Automatically derive the backward gradient flow graph from the forward 
datatlow graph 
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A Network as a Dataflow Graph 


¢Gradients can be computed by auto differentiation 
¢ Automatically derive the gradient flow graph from the forward dataflow 
graon 


Ee) Eee 
22. - by 
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Gradient Descent via Backpropagation 


¢ The computational workflow of deep learning 
¢ Forward, which we usually also call inference: forward dataflow 
¢ Backward, which derives the gradients: backward gradient flow 
¢ Apply/update gradients and repeat 


Backward 
¢ Mathematically, 


| 
eg) —~e"Dig.v-(at-) Dp”) 


| | | 


Model parameters Forward Data 
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Buruun 
Program a neural network 


¢ Define a neural network 
¢ Define operations and layers: fully-connected? Convolution’? Recurrent? 
¢ Define the data |/O: read what data from where? 


¢ Define a loss function/optimization objective: L2 loss? Softmax? 
Ranking Loss” 


¢ Define an optimization algorithm: SGD? Momentum SGD”? etc 


¢ Auto-differential Libraries will then take over 
¢ Connect operations, data I/O, loss functions and trainer. 
¢ Build forward dataflow graoh and backward gradient flow graphs. 
¢ Perform training and apply updates 
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Outline 


¢ Deep Learning as Dataflow Graphs 
¢ Auto-differentiable Libraries 
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Auto-ditfferential Libraries 


¢ Auto-ditferential Library automatically derives the gradients following the back- 
oropagation rule. 


¢ A lot of auto-differentiation libraries have been developed: 
¢ So-called Deep Learning toolkits 





BF Microsoft 


.,  CNTK 


“P* torch PYTORCH 
DyNet theano 





Caffe & Caffe2 


Chainer 


dmlc 
mxnet 
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Deep Learning [oolkits 


¢ They are adopted differently in different domains 
¢ For example BS Nicrosor 


Caffe & Caffe2 FF C N T K 


“Porch PYTORCH 
DyNet theano 


Chainer 


dmlc 
mxnet 


SS 
Vision NLP 
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Deep Learning [oolkits 


¢ They are also designed differently 
¢ Symbolic v.s. imperative programming 


Caffe 
TensorFlow 
DyNet o Caffe2 
[ torch é *- 
| theano 
Chainer 
. amlc 
PYTORCH mxnet 
SSD 
Imperative Symbolic 
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Deep Learning [oolkits 


¢ Symbolic vs. imperative programming 


¢ symbolic: write symbols to assemble the networks first, evaluate later 


¢ Imperative: immediate evaluation 


Variable( ‘A‘) 

Variable('B') 

B* A 

C + Constant(1) 

compiles the function 

f = compile(D) 

d = f(A=np.ones(10), B=np.ones(18)*2) 


x OOo DW FY 


’ 


symbolic 


import numpy as np 
a = np.ones(10) 
np.ones(10) * 2 
Dee 


b 
Cc 
d c+i1 


Imperative 
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Deep Learning [oolkits 


¢ Symbolic 

¢ Good 
¢ easy to optimize (e.g. distributed, batching, parallelization) for developers 
¢ More efficient 

¢ Bad 
¢ The way of programming might be counter-intuitive 
¢ Hard to debug for user programs 
¢ Less flexible: you need to write symbols before actually doing anything 


¢ Imperative: 
¢ Good 
¢ More flexible: write one line, evaluate one line 
¢ Easy to program and easy to debug: because it matches the way we use C++ or python 
¢ Bad 
¢ Less efficient 
¢ More difficult to optimize 
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Deep Learning [oolkits 


¢ They are also designed differently 
¢ For another example, dataflow graphs v.s. layer-by-layer construction 


Caffe TensorFlow 
2 Caffe2 
I toren DyNet theano 
PYTORCH reas 
Chainer mic 
mxnet 

eee 
Layer-by-layer Dataflow graphs 
construction © Petuum,Inc. 193 


Good and Bad of Dataflow Graphs 


¢ Dataflow graphs seems to be a dominant choice for representing 
deep learning models 

¢ What's good for dataflow graphs 
¢ Good for static workflows: define once, run for arbitrary batches/data 
¢ Programming convenience: easy to program once you get used to it. 
¢ Easy to parallelize/oatching for a fixed graph 
¢ Easy to optimize: a lot of off-the-shelf optimization techniques for graph 

¢ What's bad for dataflow graphs 


Not good tor dynamic workflows: need to define a graph for every training sample - 
> overheads 


¢ Hard to program dynamic neural networks: how can ¥Qu detine dynamic graphs 
using a language for static graphs? (e.g. LSTM, tree-LSTM). 


¢ Not easy for debugging. 


Difficult to parallelize/batching across multiple graphs: every graph is different, no 
natural batching. 
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Buvsum 
static vs. Dynamic Dataflow Graphs 


¢ Static Dataflow graphs 


¢ Define once, execute many times 
¢ For example: convolutional neural networks 
¢ Execution: Once defined, all following computation will follow the 
defined computation 
¢ Advantages 
¢ No extra effort for batching optimization, because it can be by nature batched 


¢ Itis always easy to handle a static computational dataflow graphs in all aspects, 
because of Its fixed structure 


¢ Node placement, distributed runtime, memory management, etc. 
¢ Benefit the developers 
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Buvsum 
static vs. Dynamic Dataflow Graphs 


¢ Dynamic Dataflow graphs 
¢e When do we need? 
¢ In all cases that static dataflow graphs do not work well 
¢ Variably sized inouts 
¢ Variably structured inputs 
¢ Nontrivial inference algorithms 
¢ Variably structured outputs 
¢ Etc. 
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Buvsum 
static vs. Dynamic Dataflow Graphs 


¢ Can we handle dynamic dataflow graphs? Using static 
methods (or declaration) will have a lot of problems 
¢ Difficulty in expressing complex tlow-control logic 
¢ Complexity of the computation graph implementation 
¢ Difficulty in debugging 
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Introducing DyNet 


¢ Designed for dynamic deep learning workflow, e.g. 





segments 
° elc. Words 
| & 
Phrases 
> 
in te al 
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tie iz i 


Tree-LSTM for neural machine translation, wnere each sentence defines a structure that 
corresponds to the computational flow 


Graph-LSTM for image parsing, where each image has a specific connection between 


Sentences . 
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\ ie 
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Alice gave a message to b 
a & od 2? 
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= * > d 
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> 
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Documents 


900-- This film was completely unbelievable. 
‘ 

000+ The characters were wooden and the plot was absurd. 
: 

OOO That being said, I liked it. 
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Key Ingredients in DyNet 


¢ Concept 
¢ Separate parameter declaration and graph construction 
¢ Declare trainable parameters and construct models first 
¢ Parameters, e.g. the weight matrices in an LSTM unit. 
¢ Construct a model as a collection of trainable parameters 
¢ Construct computation graphs 
¢ Allocate a few nodes for our computation (node can be seen as layers in NN) 
¢ Specify the dataflow graph by connecting nodes together 
¢ |f necessary, different graphs for different inout samples 
¢ Conclusion: Define parameter once, but define graphs dynamically depending on inputs 


model = dy.Model () 


pW = model.add parameters ((20,4)) 
pb = model.add parameters (20) 


dy.renew cg() 

x dy.inputVector([1,2,3, 41) 

W dy.parameter(pW) # convert param: 
b dy.parameter(pb) # and add ¢t 


y=WwWw* x+b 
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Key Ingredients in DyNet 


¢ Backend and programing model 
¢ Graoh construction 
¢ In TensorFlow, constructing a graph has a considerable overhead. 
¢ TensorFlow users avoid defining graphs repeatedly 
¢ DyNet: highly optimized graph definition 
¢ Little overnead defining a graph: good for dynamic neural networks. 


¢ Easy to write recursive programs to define graphs (very effective for many 
dynamic networks, such as tree-LSTM or graph-LSTM). 
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Key Ingredients in DyNet 


¢ A visual comparison 
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EeSesstegeetees 


clase TreeRiNBuilder(object): 
def _.init..(welf, model, word.vocab, bdia): 
eolf.W ~ sodel.add_parameters((hdis, 2*hdis)) 
aolf.E ~ sodel add_lookup_paranetera( (len(vord_vocab) ,hdis)) 
eolf.w2i = word vocab 


éef encode(sel!, tree): 

if tree. isleaf(): 
return self .Eleelt.w2i.get (tree. label ,0)) 

elif len(tree.childres) =~ 1: # whery node, skip 
expr ~ self. encode(tree.childres[0)) 
return expr 

else: 
assert (len(tree children) =~ 2) 
el =~ self encode(tree.children(0)) 
e2 ~ self .encode(tree.children[i}) 
W~ dy parameter(eolf .¥) 
expr ~ dy. tanh(Wedy.concatenate( [el ,e2))) 
return expr 


model = dy.Model() 
U_p = model add_parameters((2,50)) 
tree_builder ~ TreeRNWBuilder(model, word vocabulary, 50) 
trainer ~ dy AdanTrainer(model) 
for epoch in xrange(10): 
for im_tree, out label ic read_examples(): 
dy. renew_cg() 
U ~ dy. paraneter(U_p) 
loss ~ dy. pickneglogsoftaax(U+tree_builder encode(in_tree), out_label) 
loss .forvard() 
loss . backward() 
trainer update() 


DyNet TreeLSTM (30 LoC) 
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TensorFlow TreeLSTM (200 LoC) 
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Distributed Deep Learning ~ 
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PETUUM 


Outline 


¢ Overview: Distributed Deep Learning on GPUs 
¢ Challenges 1: Addressing the communication bottleneck 
¢ Challenges 2: Handling the limited GPU memory 
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Review — DL toolkits on single macnine 


¢ Using GPU is a must 


¢ Asmall number of GPU-equioped machines could achieve satisfactory 
Soeedup compared to CPU clusters with thousands of cores 


: 
j 
: 
























More readily 


¢ A cluster of 8 GPU-equioped machines 


available to 
¢ A cluster of 2000 CPU cores saanwerines 
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Review — DL toolkits on single macnine 


¢ However, using a single GPU is far from sufficient 


* average-sized deep networks can take days to train on a single GPU when 
faced with 100s of GBs to TBs of data 


¢ Demand faster training of neural networks on ever-larger datasets 














<= 3 We | i $48,384 
4 * Wi ; “a aN >a" " q q Fy . 
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I os \ Te iw im q gz qj j] 14 Hy 
Ww a. \ ; . . . ‘ : 
|) SAAR ae badaa*aa na 0a 
] a t \—___+} ' . Pes 4,1 \  / F 
: . 
: ; y ' ’ s\, \ p ; 
\ | NUT » \F ‘ \ rf q @ 
\ f N —— i oa fl 
\ Le s 0 @6=hti (‘i i‘iséCéiCE s s 
trite | Mas 128 Mas post oe | 
a pocsng pocking 
AlexNet, 5 — 7 days GoogLeNet, 10+ days 


¢ However, current distributed DL implementations (e.g. in TensorFlow) can 
scale poorly due to substantial parameter syncnronization over the network 
(we will show later) 
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Outline 


¢ Overview: Distributed Deep Learning on GPUs 
¢ Challenges 1: Addressing the communication bottleneck 
¢ Challenges 2: Handling the limited GPU memory 
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Burau 
Challenges 


¢ Communication challenges 
¢ GPUs are at least one order of magnitude faster than CPUs 


¢ High communication load raises the network communication as the main bottleneck 
given limited bandwidth of commodity Ethernet 


¢ Managing the computation and communication in a distributed GPU cluster often 
complicates the algorithm design 


bottleneck 


SRS 
=> 
os 
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Buvyum 
Let's see what causes the problem 


¢ Deep Learning on a single node — an Iterative-convergent 
formulation 


Backward 


| 


9g) ~9" Dig.v-(a—) DM) 


| | | 


Model parameters Forward Data 


Apply gradients 
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PETUUM 


Let's see what causes the problem 


¢ Deep Learning on a single node — an Iterative-convergent 
formulation 


Backward 


| 
9) =e Die.v-(at—) pM) 


| | 


Forward 





Forward and backward are the main computation (99%) workload of deep 
learning programs. 
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Distributed Deep Learning 


¢ Distriouted DL: parallelize DL training using multiple macnines. 


¢ i.e. we want to accelerate the heaviest workload (in the box) to 
multiole macnines Backward 


| 
9) ~e" Dite.v-(a—) pM) 


| | 


Forward 





Forward and backward are the main computation (99%) workload of deep 
learning programs. 
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a ee 
Data parallelism with stochastic gradient 


descent 


¢ We usually seek a parallelization strategy called data parallelism, based 
on SGD 


¢ We partition data into different parts 
¢ Let different machines compute the gradient updates on different data partitions 
¢ Then aggregate/sync. 


Data S + ~~ Se 
Soe 
Data 
Worker 3. Worker 4 Vata 
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Data Parallel SGD 


¢ Data parallel stochastic gradient descent 


¢ Data-parallelism requires every worker to have read and write 
access to the shared model parameters 8, which causes 
communication among workers; Pa peer 2” WRN 


P 
g(t+1) — gf) 4¢ » V-(6"), DY?) 
p=1 | 


Data partition p 


Collect and aggregate Happening locally on each worker 
before application, where 


communication is required 
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Bunyuy 
How to communicate 


¢ Parameter server, e.g. Bosen, SSP 
¢ A parameter server (PS) Is a shared memory system that provides a 
shared access for the global model parameters @ 
¢ Deep learning can be trivially data-parallelized over adistriouted 
workers using Ps by 3 steps: 


¢ Each worker computes the gradients (VL) on their own data partition 
(D,,) and send them to remote servers; 


¢ servers receive the updates and apply (+) them on globally shared 
parameters; 


¢ Each worker pulls back the uodated parameters (6_t) 
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Bunun 
How PS works 
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Bunun 
Parameter Server 


¢ Parameter server has been successful for CPU-based deep 
learning 
¢ Google Distbelief, Dean et al. 2012 
¢ Scale up to thousands of CPU machines and 16000 CPU cores 
¢ SoPTable, Ho et al, 2013 
¢ Stale-synchronous parallel consistency model 
¢ Microsoft Adam, Chilimbi et al. 2014 
¢ 63 machines, state-of-art results on ImageNet 22K 
¢ Bosen, Wei et al. 2015 
¢ Managed communication 
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Bunun 
Parameter Server on GPUs 


¢ Directly applying parameter server for GPU-based distributed deep 
learning will underperform (as will show later). 
¢ GPU Is too fast 
¢ Ethernet bandwidth Is limited, and has latency 


¢ For example 


¢ AlexNet: 61.5M float parameters, 0.25s/iteration on Geforce Titan X 
(batchsize = 256) 
¢ Gradient generation rate: 240M float/(s*GPU) 
¢ Parallelize it over 8 machines each w/ one GPU using PS. 
¢ To ensure the computation not blocked on GPU (i.e. linear soeed-up with 
additional nodes) 
¢ As a worker: send 240M floats/s and pull back 240M floats/s (at least) 
¢ As a server: receive 240M * 8 floats/s and send back 240M * 8/s (at least) 
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Bunun 
Parameter Server on GPUs 


«Let's see where we are 


This is what the GPU Ethernet standards 


workstation in you lab has 
: Ethernet | Rate(GBiv/s) | Rate (Mb/s) | Rate (# floats/s) 
Free [1d TM 


10GbE [10 | __1250 312.5M 
ibn | 20 3000 12508 


One of the most expensive instances 
AWS could provide you (18$/h7?) 


specialized hardware! Non- 
commodity anymore, inaffordable 
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Bunun 
Parameter Server on GPUs 


Tne problem Is more severe than described above 
¢ We only use 8 nodes (which is small). How about 32,128, or even 256”? 
¢ We haven't considered other issues (which might be also 
troublesome), €.g. 
¢ Memory copy between DRAM and GPU will have a non-trivial cost 


¢ The Ethernet might be shared with other tasks, |.e. available bandwiath Is even 
less. 


¢ Burst communication happens very often on GPUs (which will explain later). 
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Address the Communication Bottleneck 


¢ A simple fact: 
¢ Communication time may be reduced, but cannot be eliminated (of 
Course) 
¢ Therefore, possible ideas to address the communication 
bottleneck 
¢ Hide the communication time by overlapping it with the computation 
time 
¢ Reduce the size of messages needed to be communications 
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Address the Communication Bottleneck 


¢ A simple fact: 
¢ Communication time may be reduced, but cannot be eliminated (of 
Course). 
¢ Therefore, possible ideas to address the communication 
bottleneck 


¢ Hide the communication time by overlapping it with the 
computation time 


¢ Reduce the size of messages needed to be communications 


© Petuum,Inc. 220 


Buvyum 
Overlap Computation and Communication 


¢ Revisit on a single node the computation flow of BP 
¢ b): backpropagation computational through layer | 
¢ C,: forward and backward computation at iteration t 


ee--e 


L 


T 
a) =~ 9" D4 ¢.V,(a"-) p”) 
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Buvyum 
Overlap Computation and Communication 


¢ On multiole nodes, when communication is involved 


¢ Introduce two communication operations 
¢ o,: send out the gradients in layer I to the remote 
¢ i): Dull back the globally shared parameters of layer l from the remote 
° O,: the set {o,}7_, at iteration t 
- 1,: the set {i,}4_, at iteration t 6) =e" )+e.V-(0-), DY) 


oN b, j b, by 
{isos / : 
Ce | 0; | te | Cosa || Oras 
/ 


Computation and communication 
Nappen sequentially! © Petuum,Inc. 222 





Buvyum 
Overlap Computation and Communication 


¢ Note the following independency 
¢ The send-out operation o, is independent of backward operations 


¢ |The read-in operation i; could update tne layer parameters as long as 
b, was finished, without blocking the subsequent backward operations 
b; (i < l) 
¢|dea: overlap computation and communication by utilizing 
concurrency 
¢ Pipelining the updates and computation operations 


9) — 9) 4 ¢.¥,(0¢-), pp) 
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WFBP: Wait-free backpropagation 


¢ Idea: overlap computation and communication by utilizing concurrency 
¢ Pipelining the updates and computation operations 


oN TT b> - b, 
fiy}ie1 / 
| reschedule 


we 8 
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WFBP: Wait-free backpropagation 


¢|dea: overlap computation and communication by utilizing 
concurrency 
¢ Communication overhead is hidden under computation 
¢ Results: more computations in unit time 


| pipelining 
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WEBP: Distributed Wait-free backpropagation 


¢ How does WFBP perform’? 
¢ Using Caffe as an engine: 








soogLeNet (40 GbE 
wt 


—®— | inca 

—@- Posendon 
~*~ Caffe wrne 
* Callers 


- feduction~ “d 
# of Nodes . 
¢ Using TensorFlow as engine 


rtion-V3 (40 GbE) 


bottleneck BAy 








VGG19-22K (40 GbE 
—> Poomdon 
oe )= «—(Cafler WEP 
—@ Cafter?s 


Zhang et al. 2017 


ie 12 4 
# of Nodes 


\t . 
# of Nodes 


VGG19 (40 GbE) 





t* 
# of Nodes 





oe 
# of Nodes 


ie 
# of Nodes 
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WFBP: Distributed Wait-free backpropagation 


¢ Observation: Why DWBP would be effective 


¢ More statistics of modern CNNs 


Params/FLOP distribution of modern CNNs 


CONV Layers (#/% ) | FC Layers (#/% ) 

2.3M / 3.75 59M / 96.25 

7.15M /5.58 121.1M / 94.42 
FLOPs CONV Layers (#/% ) | FC Layers (#/% ) 

| AlexNet | J 1,352M/92.0- [ 117M /8.0 

| VGG-16 | | 10,937M/91.3 [| = 121.1M/8.7 


¢ 90% computation happens at bottom layers 
¢ 90% communication happens at top layers 
¢ WFBP overlaps 90% and 90% 
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WFBP: Wailt-free Backpropagation 


¢ Does overlapping communication and computation solve all the 
oroblems’? 


¢ When communication time is longer than computation, no (see the figure below). 


¢ Say, If Communication and computation are perfectly overlaooed, how many 
scalability we can achieve? 


VGGI9 (40 GbE 
| —@— Linear 
. ~@ Posewos 
single node Distributed “* Caffe WPBP 


~@ CafferPS 





VGGI9-22K (40 GbE 
* —@— linear 

~@ Poscidos 

~* Cafle+WRFHP Qap 
® CafferPS 





it 
# of Nodes 
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Address the communication bottleneck 


¢ Note a simple fact: 
¢ Communication time may be reduced, but cannot be eliminated (of 
Course). 
¢ Therefore, possible ideas to address the communication 
bottleneck 


¢ Hide the communication time by overlapping it with the computation 
time — which we have described before. 


¢ Reduce the size of messages needed to be communications 
¢ While without compromising statistical convergence 
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Introducing Sufficient Factor Broadcasting 


¢ Matrix-oarametrized models 


Multiclass Logistic 
Regression 


Feature dim. 


-——'—} 


Sparse Coding 


Feature dim. 


-——'—} 


Dictionary 
SIZe 


Distance Metric Learning 


Feature dim. 


-————— 


Neural Network 
#neurons in layer ~—1 


-————} 


#neurons IN 
layer l 
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Distributed Learning of MPMs 


¢ Learning MPMs by communicating parameter matrices between server 
and workers 
¢ Dean and Ghemawat, 2008; Dean et al, 2012; Sindhwani and Ghoting, 2012; Gopal 
and Yang, 2013; Chilimbi et al, 2014, Li et al, 2015 


¢ High communication cost and large synchronization delays 


Multiclass Logistic 
Regression 


Feature dim. = 20K #Neurons In layer 


| Ic6=A096 


re #neurons In 
#classes=325K = -_-———<U0M’__— layer fc7 

a ~AQ96 

sd 
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Neural Network (AlexNet) 


Bunun 
Contents: 


sufficient Factor (SF) Updates 


Full parameter matrix update AW can be computed as outer product of two 
vectors uv! (called sufficient factors) 
> Example: Primal stochastic gradient descent (SGD) 


] N 
min — (Wa,.3b,)+h(w 
win 7 DSi Wasb,) +h) 





*> Example: Stochastic dual coordinate — (SDCA) 


min aS z,)+h(— ZA") 





send lightweight SF updates (u,v), instead of expensive full-matrix AW updates! 
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oufficient Factor Broadcasting: 
P2P Topology + SF Updates 


es os V, 
ae ” oe Vy 


- 
- uy, V, 





lb V5 


| U,,V, 
ee... 


iy, 


© Petuum,Inc. 233 


A computing & communication tradeoff 


Training examples Ei Bi Fi Fl 


ease ESET IES YES? 


Aggregated ——, 4 — 


¢ Full uodate: 


update matrix 


¢ Pre-update 
Training 
examples zor 


! { ! ! 


Sufficient 
vectors Uy, V4 U2, V2 U3, V3 Uns Va 
¢ Stochastic algorithms PanneLey aa 
e Mini-batch: C samples 
Matrix OUK) 
Representation 


SV Representation O(U + K)C) © Petuum, Inc. 234 
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Synchronization of Parameter Replicas 


parameter server 


Model Model 
Replica 1 Replica 2 


W, 
Shared W 
States 





ua . 


W3 Replica 3 


¢ A Cost Comparison 








Transfer SVs instead of AW 








Model Model 
Replica 1 Replica 2 
ep Uy, V; p 
U2, V2 
AW, =u, @y, 
AW, =u, @v, U4, Vy AW, = 
Wim «— Wi!) + AW, + AW, 3» V3 =, 
U3, V3 U2, V2 wi WI AW, + AW, 
Model 
Replica 3 


AW, =u, @v, 
AW, =u, @v, 
We ee WE +e AW, + AW, 
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Buu 
Convergence Speedup 


= 
o © 


Time (hours) 
am NN W 
°° 


o 





Sparse Coding (SC) 


Multiclass Logistic Regression (MLR) Distance Metric Learning (DML) 


¢ 3 Benchmark ML Programs 
¢ Big parameter matrices with 6.5-8.6b entries (830+GB), running on 12- & 28- 


machine clusters 


¢ 28-machine SFB finished in 2-7 hours 
¢ Up to 5.6x faster than 28-machine PS, 12.3x faster than 28-machine Spark 


¢ PS cannot support SF communication, which requires decentralized 
storage 
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Burvsum 
Convergence Guarantee 


¢ Assumptions 
¢ Bridging model 


¢ Staleness Synchronous Parallel (SSP) with staleness 
oarameter s 


¢ Bulk Synchronous Parallel is a soecial case of SSP when 
s=0 


¢ Communication methods 


¢ Partial broadcast (PB): sending messages to a subset of 
Q(Q <P-—1) machines 


Assumption 1. (/) For all j, f; is continuously differentiable and F is bounded from below; (2) 
. ; P 
VF, VF, are Lipschitz continuous with constants Lp and L,, respectively, and let L = S~ sec) Ly; 


(3) There exists G,o* such that for all p and c, we have (almost surely) \|U,(W6, I5)|| < Gn and 
E|| [Spl Xyer, VFi(W) — VFp(W) [3 <0? 


© Petuum,Inc. 237 


Pon 
PETUUM 


Convergence Guarantee 


e Results 


Theorem 1. Let Assumption I hold, and let {W*}, p = 1,.....P, {W°} be the local sequences 
and the auxiliary sequence, respectively. 


Under full broadcasting (i.e, Q = P — 1) and set the learning rate 7 :-= 9, = Of 59 Paz ), we 
have 


e lim inf E||VFCW*)|| = 0, hence there exists a subsequence of VF(W°) that almost surely 
vanishes; 

e lim max, ||\W° — W®|| = 0, ie. the maximal disagreement between all local sequences and 
c-? oo 


the auxiliary sequence converges to O (almost surely); 
e There exists a common subsequence of {W%,} and {W*} that converges almost surely to a sta- 


. , ¥ , P c *P« 
tionary point of F, with the rate min El er VE(W5 lls < O (y tgs) 


Under partial broadcasting (i.e., Q < P — 1) and set a constant learning rate n = ere pay: 
where C’ is the total number of iterations. Then we have 


P(sG + 0° 
min [| Span VFo(W5)13] <0 (LEP - Q) + FEED) 


Hence, the algorithm converges to a O( LG(P — Q)) neighbourhood if C — ox. 
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Burvsum 
Convergence Guarantee 


¢ Take-nome message: 

¢ Under full broadcasting, given a properly-chosen 
learning rate, all local worker parameters W, 
eventually converge to stationary points (i.e. local 
minima) of the objective function, despite the fact 
that SV transmission can be delayed by up to s 
iterations. 

¢ Under partial broadcasting, the algorithm 
converges to a O(LG(P — Q)) neighbourhood If 


C — oo, 
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Buu 
Parameter Storage and 


Communication Paradigms 


Centralized Storage Decentralized Storage 


Worker 






Send change | 
AW i 


Ssendchange [ J Sendchange 


Send W itself AW AW 


Worker 


¢ Centralized: send parameter W itself from server to worker 
¢ Advantage: allows compact comms topology, é.g. bipartite 
¢ Decentralized: always send changes AW between workers 
¢ Advantage: more robust, homogeneous code, low communication (7) 
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Bavwun 
Topologies: 
Master-Slave versus P2P? 


worker 1 worker? 


worker 1 


worker 3 worker 4 





Master-slave P2P 

¢ Used with centralized storage paradigm ¢ Used with decentralized storage 

¢ Disadvantage: need to code/manage clients ¢ Disadvantage (?): high comms volume for 
and servers separately large # of workers 

¢ Advantage: bipartite topology is comms- ¢ Advantage: same code for all workers; no 
efficient single point of failure, high elasticity to 

¢ Popular for Parameter Servers: Yahoo LDA, resource adjustment 
Google DistBelief, Petuum PS, Project Adam, ¢ Less well-explored due to perception of high 
Li&Smola PS, ... Communication overhead’? 
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Hybrid Updates: PS + SFB 


¢ Hybrid communications: 
Parameter Server + 
Sufficient Factor 
Broadcasting 
¢ Parameter Server: Master- 
slave topology 


¢ Sufficient factor 
broadcasting: P2P topology 







¢ For problems with a mix of 
large and small matrices, 
¢ Send small matrices via PS ale ss 
¢ Send large matrices via SFB Uy? 


2,2 
Uo, Ua} itty, U4 


© Petuum,Inc. 242 


Hybrid example: CNN 


Hao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, Eric P. Xing. Poseidon: A 
System Architecture for Efficient GPU-based Deep Learning on Multiple Machines. USENIX ATC 2016. 


¢ Example: AlexNet CNN model 
¢ Final layers = 4096 * 30000 matrix (120M parameters) 
¢ Use SFB to communicate 
¢ 1. Decouple into two 4096 vectors: u, Vv 


¢ 2. Transmit two vectors 
¢ 3. Reconstruct the gradient matrix 








Figure from 
Krizhevsky et al. 2012 








- 
' ry ’ j ' 
” Py ’ j ' 
\ j \ ? / 
. a — \Gernce 
WJ j | 
| 
,% ’ | 
, / ' 1 
\ iq) ‘ 
' ' ’ 
; ’ J \ 
= 
i 
~? + A 
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Hybrid example: CNN 


Hao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, Eric P. Xing. Poseidon: A 
System Architecture for Efficient GPU-based Deep Learning on Multiple Machines. USENIX ATC 2016. 


¢ Example: AlexNet CNN model 
¢ Convolutional layers = e.g. 11 * 11 matrix (121 parameters) 


¢ Use Full-matrix uodates to communicate 
¢ 1. Send/receive using Master-Slave PS topology 


Figure from 
Krizhevsky et al. 2012 
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Hybrid Communication 


¢ Idea 
¢ sync FC layers using SFB 
¢ sync Conv layer using PS 


¢ Effectiveness 


¢ It directly reduces the size 
of messages In many 
Situations 


¢ Is SFB always optimal? 


¢ No, its communication 
load increases 
quadratically 
¢ The right strategy: choose 
S whenever It results in 
less communication 







is, Vo 
Uy, Uy) | Ua, Ve Uo, Uo) | U4, U4 
. ? 
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Hybrid Communication 


¢ A best of both worlds strategy 


¢ For example, AlexNet parameters between FC6 and FC7 
¢ Tradeotf between PS and SFB communication 


700 5) 


(a) (b J 
iM) e 7s) 
= 500 “x. 
ra \¢ U0) a 
5 ; } on Y 7 wr 
= a Y _ 
5 300 yt A ——f§ _y 
o a - = 100 i 
3 200 aa = a“ 
a FA a 
os > a PS+ Matrwes we ~ ‘ , PS*+Marwes 
cad est eaal ——- —— PSS se Matrices - we ——~ PS + Sl s* Mawnces 
a ~ SFB at ~ SEB 
gee ‘ 
2 4 ) 12 16 & 128 256 $12 
* of Nodes Haich Sixe 


© Petuum,Inc. 246 


Hybrid Communication 


¢ How to choose? Where Its the threshold? 


¢ Determine the best strategy depending on 
¢ Layer tyoe: CONV or FC? 


¢ Layer size 

© Batch size 

¢ # of Cluster nodes S| aoe | | a 
tan) | ARUN 


Table |: Estimated communication cost of PS, SFB and Adam 
for synchrnizing the parameters of a M x N FC layer on a clus- 
ter with Py workers and P) servers, when batchsize is XK. 
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Hybrid Communication 


¢ Hybrid communication algorithm 


Algorithm 1 Get the best comm method of layer / 


|: function BESTSCHEME(/) 
layer_property = Query(/.name) 
P, Py K = Query('n_ worker’, ‘n server’, *batchsize’) 
if /aver_property.type == ‘FC’ then 
M = layer_property. width 
N = laver_property height 
if 2K(P, —1)(M+N) < “SVS S4—*) then 
return “SFB" ’ 
end if 
LO: end if 
aE return ‘PS’ 
12: end function 


Determine the best strategy depending on 
¢ Layer type: CONV or FC? 
¢ Layer size: M,N 
¢ Batch size: K 
¢ # of Cluster nodes: P,, P 


oie 


7 FS Re 
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Hybrid Communication 


¢ Results: achieve linear scalability across different models/data with 40GbE bandwidth 
¢ Using Caffe as an engine: 


LeNet (40 GbE 







VOG19 (40 GbE 
~@— Linco 

—® VPosesdon 

~*~ «Caffe wrap 

“@ Cafter?S 





VGG19-22K (40 GbE 





—~—8— Lincar 
—@- Poseidon 
~*~ Cafleewrur 

* Callers 


is 1s 
# of Nodes # of Nodes 


¢ Using TensorFlow as engine 
rtion- V3 (40 GbE) . VGG19 (40 GbE) 


Improve over WFBP 








t* 
# of Nodes 


Zhang et al., 2015, Zhang et al. 2017 
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Hybrid Communication 


¢ Linear scalability on throughput, even with limited bandwiatn! 
¢ Make distriobuted deep learning affordable 


GoogLeNet VGG19 VGG19-22K 
16> ae sinew << leew 
wer Ponewder (100at) ~~”) «Peereden ()00e) 
“ie Poemdon (2060) 
Oe owen | 1G) 
| ~* Cafte+ rar (LOGet) 


wi 

a -<t CafterevPer ocet| 

; of Calleowrer coor) 
a ; 

vi 







— 
wer” Posedion (2004) 
~@ Poseion (5600) 
~—@ Poercon ()0Gnf) —@ Pmedon | eet) 
~~ Calee wri? (200r) 


~~ CatleeWrR? (10GeT) 
“ae «Calter whe (S008) 


w 2 ' 
= te Calle s are? 06a) prs 
3 “O Cafes wher 0Gee) S “> Calle ewrar choca) a , 
Z| a we 

vi “ 2 f- 


8 of Nodes 
# parameters 5M 143M 













~&® = Poseidon (72t) 


16 12 4 i" 

# of Nodes 
229M 

_ Ethernet | Rate(GBit/s) | Rate (Mb/s) | Rate (# floats/s) 


-iGbE [1 | 125 | _31.25M 
10GbE [10 | __1250 312.5M 
Infiband | _40 | S000| 150M 
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Hybrid Communication 


¢ Discussion: Utilizing SFs is not a new idea, actually 
¢ Microsoft Adam uses the third strategy (c) 


a “emtralived: Matrices water SFR { sever Matreces + Sks 


oush: SFs 
Pull: full matrices 
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Hybrid Communication 


¢ Adams strategy leads to communication bottleneck 
¢ Pushing SFs to server Is fine 


¢ Pulling full matrices back will create a bottleneck on the server node. 


= 





- 
a 


— 
~~ 
os 





Traffic (Gb/iter) 





TF-WFBP Adam Poseidon 
Figure 10: Averaged communication load when training 
VGG19 using 7F-WFBP, Adam and Poseidon with TensorFlow 
engine. Each bar represents the network traffic on a node. 


¢ Hybrid communication yields communication load balancing 
¢ Which is important to address the problem of burst communication. 
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Bursun 
Introducing Poseidon 


¢ Poseidon: An efficient communication architecture 
¢ A distriduted platform to amplify existing DL toolkits 





“Caffe & caf F CN ‘ 
aite © Caffe2 
| TensorFlow CNTK | 
| 
. “YP torch PYTORCH . 
toolkits | DyNet theano | 
. Chainer 
| 
\ 
~~ 


platform 
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Buvsum 
Poseidons position 


¢ Design principles 
¢ Efficient distributed platform for amplifying any DL toolkits 
¢ Preserve the programming interface for any high-level toolkits 
¢ |.e. distribute the DL program without changing any line of code 
¢ Easy deployment, easy adoption. 
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ae “ar 
(11\\ 
Poseidon system Architecture (U3) 


—»> data flow GPU | CPU KV Store ~e” 







mnie allocation 
—-» instruction ——— CYOYVO)CIO) 
C) CO OC) OC) KV Store 
syncer, ee ETE 
—— 
; . y 


SFB 
OOOO ©0000 


‘a — Coordinator 
3 3 3 3 33 
Stream Pool Thread Pool I 
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Bunun 
Poseidon APIs 


¢ KV Store, Syncer and Coordinator 


¢ Standard APIs similar to parameter server 
¢ Push/Pull API for parameter synchronization 
¢ BestScheme method to return the best communication method 


_Method | Owner | Arguments | Descriptio 
A layer name or index Get the best communication scheme of a layer 

A list of property names Query information from coordinators’ information book 

“Send | Syncer__[ None | Send out the parameter updates of the corresponding layer 

ein < from either parameter server or pect workers 


A GPU stream and an indicator Move contents between GPU and CPU, do transformations and 
of move direction application of updates if needed 
Receive gradient updates from workers 
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Amplify DL toolboxes Using Poseidon 


¢ For developers: plug Poseidon API into the backpropagation 
code, all you need to do Is: Algorithm 2 Parallelize a DL library using Poseidon 
1: function TRAIN(net) 


¢ Back propagate through layer 1 ; Sani? 
¢ Sync parameters of layer 1 


net.Forward() 
0 II net.BackwardThrouph(/) 
Wait for finishing |_—_ Kr pt eft 
8: end for 


am 


for /—L—1do 


e Am 0 | ity Ng (500 Q le [ensorFlow 9 wait until(sync_count == net.num layers) 
10: end for 
11: end function 
° 200 INE e) code 12: function SYNC(/) 
: : 13: t = st _pool.Allocate() 
e Am D | ity 1) QO C ate 14: suceniiisiineliceen GPU2CPU) 
15: syncers|l|.method = coordinator.BestScheme(!) 
e 7 5O | | ne of Cod fal 16: syncers|l|.Send() 
17: syncers|l|.Receive() 
18: syncers|l|.Move(stream, CPU2GPU) 
19: syne _count++ 


20: end function 
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Buvsum 
Using Poseidon 


¢ Poseidon: An efficient communication architecture 


¢ Preserve the programming interface for any high-level toolkits 
¢ |.e. distribute the DL program without changing any line of application code 





w- 1 . ——” “in BF Microsoft ‘\ 

f! Caffe 'S Caffe2 ' 

cutee 1 Et CNTK 

toch = - an ) 
toolkits ! \dy/net theano | 

fl Se 

i 

I 

. / 
platform 
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PETUUM 


Outline 


¢ Overview: Distributed Deep Learning on GPUs 
¢ Challenges 1: Addressing the communication bottleneck 
¢ Challenges 2: Handling the limited GPU memory 
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What is the Issue 


¢ Vemory 
¢ GPUs have dedicate memory 


¢ For a DL training program to be efficient, its data must be placed on 
(GPU memory 


¢ GPU memory Is limited, compared to CPU, e.g. maximally 12Gb 
¢ Memcpy between CPU and GPU Is expensive — a memcpy takes the 
same time as launching a GPU computation kernel 
¢ Problems to be answered 
¢ How to Avoid memcpy overhead between CPU and GPU”? 


¢ How to proceed the training of a gigantic network with very limited 
available memory’? 
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A Machine w/o GPU 


Network 
NIC 


CPU cores 





DRAM 


(CPU memory) 
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A Machine w/ GPU 


Network 


| 
NIC 


CPU cores 





GPU device 


(SPU cores 
DRAM 
(CPU memory) 





Small GPU memory 
Expensive to copy between GPU/CPU mem 
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Machine Learning on GPU 





a mini-batch of training data » 


CPU memory GPU memory 
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Deep Learning on GPU 


Class probabilities 


x 
S x4 
: 







Training batch 










oarameters 





GPU memory 






JAX 
aR 


oer Intermediate states 
’ 2 re 
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oS 
Be, UUM Training batch 


Numbers 


oarameters 


GPU memory 





Max available GPU memory: 12G 


Intermediate states 


Batch size Parameters | Intermediat 
+ grads e states 





AlexNet 150MB <500M 4.5G 
GoogLeNet 64 19MB <40M 10G 
VGG19 16 10MB <1.2G 10.8G 
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Buruun 
Why Memory Is an Issue’? 


¢ Intermediate states occupy 90% of the GPU memory 
¢ Intermediate states Is proportional to input batch size 


e However, 


¢ If you want high throughput, you must have large batch size (because 
of the SIMD nature of GPUs) 


¢ If you have large batch size, your GPU will be occupied by 
intermediate states, which thereby limits your model size/depth 
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saving Memory: A Simple Trick 


¢ Basic idea 
¢ The fact: intermediate states are proportional to the batch size K 


¢ Idea: achieve large batch size by accumulating gradients generated by smaller batch sizes 
which are affordable in the GPU memory 


¢ Solution: 
¢ Parition K into M parts, every part has K/M samples 
¢ For iter = 1:M 


¢ Train with mini-batchsize K/M 
¢ Accumulate the gradient on GPU w/o updating model parameters 
¢ Update the model parameter all together when all M parts Ttinished 


¢ Drawbacks 
¢ What if the GPU still cannot afford the intermediate states even if K=1? 
¢ Small batch size usually leads to Insufficient use of GPUS’ computational capability 
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Memory Management using CPU Memory 


¢ Core ideas 
¢ If the memory Is limited, trade something for memory 
¢ Trade extra computations for memory 
¢ Trade other cost (e.g. memory exchange) for more available memory 
¢ If the memory is limited, then get more 
¢ model parallel 
¢ CPU memory 
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Memory Management using CPU Memory 


Class probabilities 


NA 
ZL MK 
: 


wN\ 
5B 


Training images 


¢ For each iteration (mini- 
batch) 
¢ A forward pass 
¢ Then a backward pass 





¢ Each time only data of two 
layers are used 
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Memory Management using CPU Memory 


Class probabilities 


Q00 0 
OO 


BEER 


Training images 


¢ For each iteration (mini- 
batch) 
¢ A forward pass 
¢ Then a backward pass 





¢ Each time only data of two 
layers are used 
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Memory Management using CPU Memory 


Class probabilities | | _ 
i ¢ For each iteration (mini- 


2000] batch) 


¢ A forward pass 
isn ¢ Then a backward pass 
¢ Each time only data of two 
AAA layers are used 
OO0000 80 


Training images 
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Memory Management using CPU Memory 


Class probabilities 


SY 


EO 
000000 


Training images 


¢ For each iteration (mini- 
batch) 
¢ A forward pass 
¢ Then a backward pass 





¢ Each time only data of two 
layers are used 


© Petuum,Inc. 272 


Memory Management using CPU Memory 


Class probabilities 


¢ For each iteration (mini- 
Q RY VY batch) 


¢ A forward pass 
ZR ¢ Then a backward pass 


——————— ¢ Each time only data of two 
600060 layers are used 
Training images 
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Memory Management using CPU Memory 


Class probabilities 


¢ For each iteration (mini- 
. we Y batch) 
eRe - A forward pass 


¢ Then a backward pass 


¢ Each time only data of 
two layers are used 





Training images 


The idea 
¢ Use GPU mem as a cache to keep actively used data 
¢ Store the remaining in CPU memory 
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Memory Management using CPU Memory 







Very expensive, for input data 
sometimes more 


expensive than 
computation 


CPU/GPU 
data transfer 





CPU memory GPU memory 
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Memory Management using CPU Memory 














Staging me 
for input data 
Controller/Scheduler 


to alleviate/hide this 
overhead 


CPU/GPU 
data transfer 


CPU memory GPU memory 
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Memory Management using CPU Memory 


¢ Controller 
¢ The fact: the memory access order is deterministic and can be exactly 
Known by a single forward and backward pass 


¢ |dea: 
¢ Obtain the memory access order by a virtual iteration 
¢ Pre-fetch memory blocks from CPU to GPU 
¢ Overlao memory swap overhead with computation 
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Memory Management using CPU Memory 


¢ Whats the best we can do with this strategy 


¢ We only need 3 memory blocks (peak size) on GPU for: 
¢ Inout, Parameters, Outout 
¢ The whole training can process with ONLY these three blocks by 
¢ Scheduling memcpy between CPU and GPU to be overlapped with computation 


¢ Move in and out for each layer’s computation as training proceeds 
le9 
pe i 


WV) 


a a TT ae 
w M8 Input data 
= 0.8} Intermediate states 
Mm Parameter data 


g 
oad 
oO 


0.4 





Memory usage ( 


0 10 20~—(«20 10 0 
forward backward 
Neural network layers 


© Petuum,Inc. 278 


Throughput vs. memory duaget 





4 800 
o . . 
" } | } ; 
% 600 elowiperencnais ; enarwaawave sae al ice aria pmo sen ee) 
OD | 7 
v . . All data | In GPU memor 
£ AOO------- ee ep Tee CN Poe 3 y 
e 3 | 
a] 200 Only buffer pool in GPU memory | 
5 Twice the peak size for double buffering 
- : : 
~ 0 

0 1 2 3 4 < 


GPU memory per machine (GB) 


¢ Only 27% reduction in throughput with 35% memory 
¢ Can do 3x bigger problems with little overhead 
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Larger models 
lel2 


2 
gL... 


trained/sec 
oO Oo Oo FF 
iS 


© 
© 





Connections 


5 10 15 20 
Model parameter size (GB) 


¢ Models up to 20 GB 
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Bursun 
Summary 


¢ Deep learning as dataflow graphs 


¢ A lot of auto-differentiation libraries have been developed to train NNs 
¢ Different adoption, advantages, disadvantages 
¢ DyNet is a new framework for next-wave dynamic NNs 
¢ Difficulties arise when scaling up DL using distributed GPUs 
¢ Communication bottleneck 
¢ Memory limit 


¢ Poseidon as a platform to support and amplify different kinds of DL 
toolboxes 


© Petuum,Inc. 281 


Poet 
PETUUM 


Elements of Modern Al 
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sys-Alg Co-design Inside! 








Model 


Our “VML" Algorithm 
software Layer 


Implementation 


System 
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PETUUM 


Better Performance 


¢ Fast and Real-Time 


¢ Orders of magnitude 
faster than Spark and 
TensorFlow 


¢ As fast as hand-crafted 
systems 


Ks 


Speedup vs Soar K: 





10 822 
~.. Up to 200x faster on some ML 
e 8 *.2,¢ algorithms 
2 
— 6 ad 
S 4 3.88 
© “at e%s 
10) 
£& 2 
?) _— “. 
and- 
Spark Cranea PetuumOS 
System 


¢ Any Scale 


¢ Perfect straight-line 
speedup with more 
computing devices 


¢ Spark, TensorFlow can 
slow down with more 
devices 


Up to 20x faster deep learning 
vs TensorFlow 


~@- Lincar 
—~ Poseidon 


—<S Tomeorklow 


| 2 4 7 ‘Lo 


Number of GPU computers 


¢ Low Resource 


¢ Turning a regular 
cluster into a super 
computer: 
¢ Achieve Al results with much 


more data, but using fewer 
computing devices 


¢ Google brain uses ~ 1000 
machines whereas Petuum 
uses ~10 for the same job 
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Bursum 
A Petuum Vision 
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